Re: hadoop LZ4 incompatible with open source LZ4
@Wes Okay I think I figured it out why I could not read LZ4 encoded parquet file generated by parquet-mr. Turns out hadoop LZ4 has its own framing format. I summarized details in the JIRA ticket you posted: https://issues.apache.org/jira/browse/PARQUET-1241 Thanks, Alex Wang, On Tue, 7 Aug 2018 at 12:13, Wes McKinney wrote: > hi Alex, > > here's one thread I remember about this > > https://github.com/dask/fastparquet/issues/314#issuecomment-371629605 > > and a relevant unresolved JIRA > > https://issues.apache.org/jira/browse/PARQUET-1241 > > The first step to resolving this issue is to reconcile what mode of > LZ4 the Parquet format is supposed to be using > > - Wes > > > On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang wrote: > > Hi Wes, > > > > Just to share my understanding, > > > > In Arrow, my understanding is that it downloads the lz4 from > > https://github.com/lz4/lz4 (via export > > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a). So it > is > > using the LZ4_FRAMED codec. But hadoop is not using framed lz4. So i'll > > see if I could implement a CodecFactory handle for LZ4_FRAMED in > parquet-mr, > > > > Thanks, > > > > > > On Tue, 7 Aug 2018 at 08:50, Wes McKinney wrote: > > > >> hi Alex, > >> > >> No, if you look at the implementation in > >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32 > >> it is not using the same LZ4 compression style that Hadoop is using; > >> realistically we need to add a bunch of options to Lz4Codec to be able > >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in > >> my e-mail to find the prior thread > >> > >> - Wes > >> > >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang wrote: > >> > Hi Wes, > >> > > >> > Are you talking about this ? > >> > > >> > http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E > >> > > >> > I tried to compile with the latest arrow which contain this fix and > still > >> > encountered the corruption error. > >> > > >> > Also, we tried to read the file using pyparquet, and spark, did not > work > >> > either, > >> > > >> > Thanks, > >> > Alex Wang, > >> > > >> > > >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney > wrote: > >> > > >> >> hi Alex, > >> >> > >> >> I think there was an e-mail thread or JIRA about this, would have to > >> >> dig it up. LZ4 compression was originally underspecified (has that > >> >> been fixed) and we aren't using the correct compressor/decompressor > >> >> options in parquet-cpp at the moment. If you have time to dig in and > >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in > >> >> Apache Arrow > >> >> > >> >> - Wes > >> >> > >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang > wrote: > >> >> > Hi, > >> >> > > >> >> > Would like to kindly confirm my observation, > >> >> > > >> >> > We use parquet-mr (java) to generate parquet file with LZ4 > >> compression. > >> >> To > >> >> > do this we have to compile/install hadoop native library with > provides > >> >> LZ4 > >> >> > codec. > >> >> > > >> >> > However, the generated parquet file, is not recognizable by > >> >> parquet-cpp. I > >> >> > encountered following error when using the `tools/parquet_reader` > >> binary, > >> >> > > >> >> > ``` > >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data. > >> >> > ``` > >> >> > > >> >> > Further search online get me to this JIRA ticket: > >> >> > https://issues.apache.org/jira/browse/HADOOP-12990 > >> >> > > >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr > lz4 > >> is > >> >> > not compatible with parquet-cpp? > >> >> > > >> >> > Thanks, > >> >> > -- > >> >> > Alex Wang, > >> >> > Open vSwitch developer > >> >> > >> > > >> > > >> > -- > >> > Alex Wang, > >> > Open vSwitch developer > >> > > > > > > -- > > Alex Wang, > > Open vSwitch developer > -- Alex Wang, Open vSwitch developer
[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format
[ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328 ] Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:57 AM: Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below, which tries to identify hadoop LZ4 format if the initial try failed: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { + // For hadoop lz4 compression format, the compressed data is prefixed + // with original data length (big-endian) and then compressed data + // length (big-endian). + // + // If the prefix could match the format, try to decompress from 'input + 8'. + if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) == input_len - 8) { + decompressed_size = LZ4_decompress_safe( + reinterpret_cast(input) + 8, reinterpret_cast(output_buffer), + static_cast(input_len) - 8, static_cast(output_len)); + if (decompressed_size >= 0) { + return Status::OK(); + } + } return Status::IOError("Corrupt Lz4 compressed data."); } return Status::OK();{noformat} was (Author: ee07b291): Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len),
[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format
[ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328 ] Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM: Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { + // For hadoop lz4 compression format, the compressed data is prefixed + // with original data length (big-endian) and then compressed data + // length (big-endian). + // + // If the prefix could match the format, try to decompress from 'input + 8'. + if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) == input_len - 8) { + decompressed_size = LZ4_decompress_safe( + reinterpret_cast(input) + 8, reinterpret_cast(output_buffer), + static_cast(input_len) - 8, static_cast(output_len)); + if (decompressed_size >= 0) { + return Status::OK(); + } + } return Status::IOError("Corrupt Lz4 compressed data."); } return Status::OK();{noformat} was (Author: ee07b291): Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { + // For hadoop lz4
[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format
[ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328 ] Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM: Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { + // For hadoop lz4 compression format, the compressed data is prefixed + // with original data length (big-endian) and then compressed data + // length (big-endian). + // + // If the prefix could match the format, try to decompress from 'input + 8'. + if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) == input_len - 8) { + decompressed_size = LZ4_decompress_safe( + reinterpret_cast(input) + 8, reinterpret_cast(output_buffer), + static_cast(input_len) - 8, static_cast(output_len)); + if (decompressed_size >= 0) { + return Status::OK(); + } + } return Status::IOError("Corrupt Lz4 compressed data."); } return Status::OK();{noformat} was (Author: ee07b291): Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { +
[jira] [Commented] (PARQUET-1241) Use LZ4 frame format
[ https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328 ] Alex Wang commented on PARQUET-1241: Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list, And upon further investigation I found from the hadoop Jira ticket (https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format prefixes the compressed data with *original data length (big-endian)* and then *compressed data length (big-endian)*. Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 release) generated parquet file (with LZ4 compression), I could confirm that the compressed column page buffer indeed has the 8-byte prefix. {noformat} # From gdb: Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "") at /opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36 36 static_cast(input_len), static_cast(output_len)); Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 (gdb) p/x *((uint32_t*)input+1) $3 = 0xcbac0100 # From python (convert to little endian): >>> 0x0001accb 109771 {noformat} The result *109771 = 109779 - 8*. And if I skipped the first 8 bytes and decompress, I could get correct column values. Seems to me that hadoop will not likely change this format (been there since 2011), I'd like to propose changes like below: {noformat} diff --git a/cpp/src/arrow/util/compression_lz4.cc b/cpp/src/arrow/util/compression_lz4.cc index 23a5c39..feeb124 100644 --- a/cpp/src/arrow/util/compression_lz4.cc +++ b/cpp/src/arrow/util/compression_lz4.cc @@ -22,6 +22,7 @@ #include #include "arrow/status.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" namespace arrow { @@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t out reinterpret_cast(input), reinterpret_cast(output_buffer), static_cast(input_len), static_cast(output_len)); if (decompressed_size < 0) { + // For hadoop lz4 compression format, the compressed data is prefixed + // with original data length (big-endian) and then compressed data + // length (big-endian). + // + // If the prefix could match the format, try to decompress from 'input + 8'. + if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) == input_len - 8) { + decompressed_size = LZ4_decompress_safe( + reinterpret_cast(input) + 8, reinterpret_cast(output_buffer), + static_cast(input_len) - 8, static_cast(output_len)); + if (decompressed_size >= 0) { + return Status::OK(); + } + } return Status::IOError("Corrupt Lz4 compressed data."); } return Status::OK();{noformat} > Use LZ4 frame format > > > Key: PARQUET-1241 > URL: https://issues.apache.org/jira/browse/PARQUET-1241 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp, parquet-format >Reporter: Lawrence Chan >Priority: Major > > The parquet-format spec doesn't currently specify whether lz4-compressed data > should be framed or not. We should choose one and make it explicit in the > spec, as they are not inter-operable. After some discussions with others [1], > we think it would be beneficial to use the framed format, which adds a small > header in exchange for more self-contained decompression as well as a richer > feature set (checksums, parallel decompression, etc). > The current arrow implementation compresses using the lz4 block format, and > this would need to be updated when we add the spec clarification. > If backwards compatibility is a concern, I would suggest adding an additional > LZ4_FRAMED compression type, but that may be more noise than anything. > [1] https://github.com/dask/fastparquet/issues/314 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1352) [CPP] Trying to write an arrow table with structs to a parquet file
[ https://issues.apache.org/jira/browse/PARQUET-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573720#comment-16573720 ] Wes McKinney commented on PARQUET-1352: --- Either you can contribute to the nested data support project (which is ongoing in https://github.com/apache/parquet-cpp/pull/462) or wait for other people to do it. I hope it gets done by the end of 2018 > [CPP] Trying to write an arrow table with structs to a parquet file > --- > > Key: PARQUET-1352 > URL: https://issues.apache.org/jira/browse/PARQUET-1352 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Affects Versions: cpp-1.4.0 >Reporter: Dragan Markovic >Priority: Major > > Relevant issue:[https://github.com/apache/arrow/issues/2287] > > I'm creating a struct with the following schema in arrow: > https://pastebin.com/Cc8nreBP > > When I try to convert that table to a .parquet file, the file gets created > with a valid schema (the one I posted above) and then throws this exception: > "lemented: Level generation for Struct not supported yet". > > Here's the code: [https://ideone.com/DJkKUF] > > Is there any way to write arrow table of structs to a .parquet file in cpp? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: hadoop LZ4 incompatible with open source LZ4
Hi Wes, Thanks again for the pointers. During investigation I noticed this possible bug, The private variables 'min_', 'max_' in 'TypedRowGroupStatistics' class is not initialized in constructor. And I got an abort while trying to read a column using 'parquet_reader'. And gdb breakpoint at constructor shows that those variables are not initialized. ``` Breakpoint 1, parquet::TypedRowGroupStatistics >::TypedRowGroupStatistics ( this=0xbafd48, schema=0xc17080, encoded_min="", encoded_max="", num_values=0, null_count=39160, distinct_count=0, has_min_max=true, pool=0xbab8e0 ) at /opt/gpdbbuild/parquet-cpp/src/parquet/statistics.cc:74 74if (!encoded_min.empty()) { (gdb) p min_ $37 = 116 (gdb) p max_ $38 = false ``` And since both 'encoded_min' and 'encoded_max' are empty string, they are never set... So, if this is valid issue, i'm proposing the following fix: ``` diff --git a/src/parquet/statistics.cc b/src/parquet/statistics.cc index ea7f783..5d61bc9 100644 --- a/src/parquet/statistics.cc +++ b/src/parquet/statistics.cc @@ -65,6 +65,8 @@ TypedRowGroupStatistics::TypedRowGroupStatistics( : pool_(pool), min_buffer_(AllocateBuffer(pool_, 0)), max_buffer_(AllocateBuffer(pool_, 0)) { + using T = typename DType::c_type; + IncrementNumValues(num_values); IncrementNullCount(null_count); IncrementDistinctCount(distinct_count); @@ -73,9 +75,13 @@ TypedRowGroupStatistics::TypedRowGroupStatistics( if (!encoded_min.empty()) { PlainDecode(encoded_min, _); + } else { +min_ = T(); } if (!encoded_max.empty()) { PlainDecode(encoded_max, _); + } else { +max_ = T(); } has_min_max_ = has_min_max; } ``` Thanks, On Tue, 7 Aug 2018 at 12:13, Wes McKinney wrote: > hi Alex, > > here's one thread I remember about this > > https://github.com/dask/fastparquet/issues/314#issuecomment-371629605 > > and a relevant unresolved JIRA > > https://issues.apache.org/jira/browse/PARQUET-1241 > > The first step to resolving this issue is to reconcile what mode of > LZ4 the Parquet format is supposed to be using > > - Wes > > > On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang wrote: > > Hi Wes, > > > > Just to share my understanding, > > > > In Arrow, my understanding is that it downloads the lz4 from > > https://github.com/lz4/lz4 (via export > > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a). So it > is > > using the LZ4_FRAMED codec. But hadoop is not using framed lz4. So i'll > > see if I could implement a CodecFactory handle for LZ4_FRAMED in > parquet-mr, > > > > Thanks, > > > > > > On Tue, 7 Aug 2018 at 08:50, Wes McKinney wrote: > > > >> hi Alex, > >> > >> No, if you look at the implementation in > >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32 > >> it is not using the same LZ4 compression style that Hadoop is using; > >> realistically we need to add a bunch of options to Lz4Codec to be able > >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in > >> my e-mail to find the prior thread > >> > >> - Wes > >> > >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang wrote: > >> > Hi Wes, > >> > > >> > Are you talking about this ? > >> > > >> > http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E > >> > > >> > I tried to compile with the latest arrow which contain this fix and > still > >> > encountered the corruption error. > >> > > >> > Also, we tried to read the file using pyparquet, and spark, did not > work > >> > either, > >> > > >> > Thanks, > >> > Alex Wang, > >> > > >> > > >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney > wrote: > >> > > >> >> hi Alex, > >> >> > >> >> I think there was an e-mail thread or JIRA about this, would have to > >> >> dig it up. LZ4 compression was originally underspecified (has that > >> >> been fixed) and we aren't using the correct compressor/decompressor > >> >> options in parquet-cpp at the moment. If you have time to dig in and > >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in > >> >> Apache Arrow > >> >> > >> >> - Wes > >> >> > >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang > wrote: > >> >> > Hi, > >> >> > > >> >> > Would like to kindly confirm my observation, > >> >> > > >> >> > We use parquet-mr (java) to generate parquet file with LZ4 > >> compression. > >> >> To > >> >> > do this we have to compile/install hadoop native library with > provides > >> >> LZ4 > >> >> > codec. > >> >> > > >> >> > However, the generated parquet file, is not recognizable by > >> >> parquet-cpp. I > >> >> > encountered following error when using the `tools/parquet_reader` > >> binary, > >> >> > > >> >> > ``` > >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data. > >> >> > ``` > >> >> > > >> >> > Further search online get me to this JIRA ticket: > >> >> >
Date and time for next Parquet sync
Hi All, It has been a while since we had a Parquet sync, therefore I'd like to propose to have one next week on August 15th, at 6pm CET / 9 am PST. I'll send a meeting invite with the details soon, let me know if this time is not suitable for you! Since the last sync there are couple of topics to discuss, like: - Status of Parquet encryption - Release a new minor version, scope of the new release - Bloom filters - Move Java specific code from parquet-format to parquet-mr - parquet.thrift usage best practices in different language bindings (Java, C++, Python, Rust) - LZ4 incompatibility The agenda is open for suggestions. Regards, Nandor
[jira] [Created] (PARQUET-1373) Encryption key management tools
Gidon Gershinsky created PARQUET-1373: - Summary: Encryption key management tools Key: PARQUET-1373 URL: https://issues.apache.org/jira/browse/PARQUET-1373 Project: Parquet Issue Type: New Feature Components: parquet-mr Reporter: Gidon Gershinsky Assignee: Gidon Gershinsky Parquet Modular Encryption ([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an API that accepts keys, arbitrary key metadata and key retrieval callbacks - which allows to implement basically any key management policy on top of it. This Jira will add tools that implement a set of best practice elements for key management. This is not an end-to-end key management, but rather a set of components that might simplify design and development of an end-to-end solution. For example, the tools will cover * modification of key metadata inside existing Parquet files. * support for re-keying that doesn't require modification of Parquet files. Parquet will not mandate a use of these tools. Users will be able to continue working with the basic API, to create any custom key management solution that addresses their security requirements. If helps, they can also utilize some or all of these tools. -- This message was sent by Atlassian JIRA (v7.6.3#76005)