Re: hadoop LZ4 incompatible with open source LZ4

ALeX Wang Tue, 07 Aug 2018 11:10:24 -0700

Hi Wes,

Just to share my understanding,


In Arrow, my understanding is that it downloads the lz4 from
https://github.com/lz4/lz4 (via export
LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,

Thanks,


On Tue, 7 Aug 2018 at 08:50, Wes McKinney <[email protected]> wrote:

> hi Alex,
>
> No, if you look at the implementation in
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> it is not using the same LZ4 compression style that Hadoop is using;
> realistically we need to add a bunch of options to Lz4Codec to be able
> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> my e-mail to find the prior thread
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang <[email protected]> wrote:
> > Hi Wes,
> >
> > Are you talking about this ?
> >
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%[email protected]%3E
> >
> > I tried to compile with the latest arrow which contain this fix and still
> > encountered the corruption error.
> >
> > Also, we tried to read the file using pyparquet, and spark, did not work
> > either,
> >
> > Thanks,
> > Alex Wang,
> >
> >
> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney <[email protected]> wrote:
> >
> >> hi Alex,
> >>
> >> I think there was an e-mail thread or JIRA about this, would have to
> >> dig it up. LZ4 compression was originally underspecified (has that
> >> been fixed) and we aren't using the correct compressor/decompressor
> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> Apache Arrow
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang <[email protected]> wrote:
> >> > Hi,
> >> >
> >> > Would like to kindly confirm my observation,
> >> >
> >> > We use parquet-mr (java) to generate parquet file with LZ4
> compression.
> >> To
> >> > do this we have to compile/install hadoop native library with provides
> >> LZ4
> >> > codec.
> >> >
> >> > However, the generated parquet file, is not recognizable by
> >> parquet-cpp.  I
> >> > encountered following error when using the `tools/parquet_reader`
> binary,
> >> >
> >> > ```
> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> > ```
> >> >
> >> > Further search online get me to this JIRA ticket:
> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >
> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
> is
> >> > not compatible with parquet-cpp?
> >> >
> >> > Thanks,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer

Re: hadoop LZ4 incompatible with open source LZ4

Reply via email to