For hadoop LZ4 compression, the compressed data is prefixed with 4-byte
original data length (big-endian) and then 4-byte compressed data length
(big-endian).  For details, please refer to JIRA ticket:

    https://issues.apache.org/jira/browse/HADOOP-12990

This causes LZ4 compressed chunk (e.g. parquet column data page) by code
using hadoop LZ4 codec (e.g. apache/parquet-mr project) to be not parse-able
in arrow.

This commit proposes a fix which detect the first 8 bytes matching the
hadoop format and try to decompress from after the prefix.  I already
confirmed the commit work in JIRA ticket:

    https://issues.apache.org/jira/browse/PARQUET-1241

Signed-off-by: Alex Wang <[email protected]>

[ Full content available at: https://github.com/apache/arrow/pull/2479 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to