kiszk commented on a change in pull request #7789:
URL: https://github.com/apache/arrow/pull/7789#discussion_r464803006
##########
File path: cpp/src/arrow/util/compression_lz4.cc
##########
@@ -349,6 +351,96 @@ class Lz4Codec : public Codec {
const char* name() const override { return "lz4_raw"; }
};
+// ----------------------------------------------------------------------
+// Lz4 Hadoop "raw" codec implementation
+
+class Lz4HadoopCodec : public Lz4Codec {
+ public:
+ Result<int64_t> Decompress(int64_t input_len, const uint8_t* input,
+ int64_t output_buffer_len, uint8_t*
output_buffer) override {
+ // The following variables only make sense if the parquet file being read
was
+ // compressed using the Hadoop Lz4Codec.
+ //
+ // Parquet files written with the Hadoop Lz4Codec contain at the beginning
+ // of the input buffer two uint32_t's representing (in this order) expected
+ // decompressed size in bytes and expected compressed size in bytes.
+ //
+ // The Hadoop Lz4Codec source code can be found here:
+ //
https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/src/codec/Lz4Codec.cc
Review comment:
Thank you for adding the link. Got it. While the original code uses
`bswap`, the intention seems to write int32 as a big-endian format. The issue
is at https://issues.apache.org/jira/browse/HADOOP-11505.
The current code of this part looks good.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]