@Wes Okay I think I figured it out why I could not read LZ4 encoded parquet file generated by parquet-mr.
Turns out hadoop LZ4 has its own framing format. I summarized details in the JIRA ticket you posted: https://issues.apache.org/jira/browse/PARQUET-1241 Thanks, Alex Wang, On Tue, 7 Aug 2018 at 12:13, Wes McKinney <wesmck...@gmail.com> wrote: > hi Alex, > > here's one thread I remember about this > > https://github.com/dask/fastparquet/issues/314#issuecomment-371629605 > > and a relevant unresolved JIRA > > https://issues.apache.org/jira/browse/PARQUET-1241 > > The first step to resolving this issue is to reconcile what mode of > LZ4 the Parquet format is supposed to be using > > - Wes > > > On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang <ee07b...@gmail.com> wrote: > > Hi Wes, > > > > Just to share my understanding, > > > > In Arrow, my understanding is that it downloads the lz4 from > > https://github.com/lz4/lz4 (via export > > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a). So it > is > > using the LZ4_FRAMED codec. But hadoop is not using framed lz4. So i'll > > see if I could implement a CodecFactory handle for LZ4_FRAMED in > parquet-mr, > > > > Thanks, > > > > > > On Tue, 7 Aug 2018 at 08:50, Wes McKinney <wesmck...@gmail.com> wrote: > > > >> hi Alex, > >> > >> No, if you look at the implementation in > >> > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32 > >> it is not using the same LZ4 compression style that Hadoop is using; > >> realistically we need to add a bunch of options to Lz4Codec to be able > >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in > >> my e-mail to find the prior thread > >> > >> - Wes > >> > >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang <ee07b...@gmail.com> wrote: > >> > Hi Wes, > >> > > >> > Are you talking about this ? > >> > > >> > http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E > >> > > >> > I tried to compile with the latest arrow which contain this fix and > still > >> > encountered the corruption error. > >> > > >> > Also, we tried to read the file using pyparquet, and spark, did not > work > >> > either, > >> > > >> > Thanks, > >> > Alex Wang, > >> > > >> > > >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > > >> >> hi Alex, > >> >> > >> >> I think there was an e-mail thread or JIRA about this, would have to > >> >> dig it up. LZ4 compression was originally underspecified (has that > >> >> been fixed) and we aren't using the correct compressor/decompressor > >> >> options in parquet-cpp at the moment. If you have time to dig in and > >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in > >> >> Apache Arrow > >> >> > >> >> - Wes > >> >> > >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang <ee07b...@gmail.com> > wrote: > >> >> > Hi, > >> >> > > >> >> > Would like to kindly confirm my observation, > >> >> > > >> >> > We use parquet-mr (java) to generate parquet file with LZ4 > >> compression. > >> >> To > >> >> > do this we have to compile/install hadoop native library with > provides > >> >> LZ4 > >> >> > codec. > >> >> > > >> >> > However, the generated parquet file, is not recognizable by > >> >> parquet-cpp. I > >> >> > encountered following error when using the `tools/parquet_reader` > >> binary, > >> >> > > >> >> > ``` > >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data. > >> >> > ``` > >> >> > > >> >> > Further search online get me to this JIRA ticket: > >> >> > https://issues.apache.org/jira/browse/HADOOP-12990 > >> >> > > >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr > lz4 > >> is > >> >> > not compatible with parquet-cpp? > >> >> > > >> >> > Thanks, > >> >> > -- > >> >> > Alex Wang, > >> >> > Open vSwitch developer > >> >> > >> > > >> > > >> > -- > >> > Alex Wang, > >> > Open vSwitch developer > >> > > > > > > -- > > Alex Wang, > > Open vSwitch developer > -- Alex Wang, Open vSwitch developer