hi Alex,

here's one thread I remember about this

https://github.com/dask/fastparquet/issues/314#issuecomment-371629605

and a relevant unresolved JIRA

https://issues.apache.org/jira/browse/PARQUET-1241

The first step to resolving this issue is to reconcile what mode of
LZ4 the Parquet format is supposed to be using

- Wes


On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang <ee07b...@gmail.com> wrote:
> Hi Wes,
>
> Just to share my understanding,
>
> In Arrow, my understanding is that it downloads the lz4 from
> https://github.com/lz4/lz4 (via export
> LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
> using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,
>
> Thanks,
>
>
> On Tue, 7 Aug 2018 at 08:50, Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Alex,
>>
>> No, if you look at the implementation in
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
>> it is not using the same LZ4 compression style that Hadoop is using;
>> realistically we need to add a bunch of options to Lz4Codec to be able
>> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
>> my e-mail to find the prior thread
>>
>> - Wes
>>
>> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang <ee07b...@gmail.com> wrote:
>> > Hi Wes,
>> >
>> > Are you talking about this ?
>> >
>> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
>> >
>> > I tried to compile with the latest arrow which contain this fix and still
>> > encountered the corruption error.
>> >
>> > Also, we tried to read the file using pyparquet, and spark, did not work
>> > either,
>> >
>> > Thanks,
>> > Alex Wang,
>> >
>> >
>> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney <wesmck...@gmail.com> wrote:
>> >
>> >> hi Alex,
>> >>
>> >> I think there was an e-mail thread or JIRA about this, would have to
>> >> dig it up. LZ4 compression was originally underspecified (has that
>> >> been fixed) and we aren't using the correct compressor/decompressor
>> >> options in parquet-cpp at the moment. If you have time to dig in and
>> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
>> >> Apache Arrow
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang <ee07b...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > Would like to kindly confirm my observation,
>> >> >
>> >> > We use parquet-mr (java) to generate parquet file with LZ4
>> compression.
>> >> To
>> >> > do this we have to compile/install hadoop native library with provides
>> >> LZ4
>> >> > codec.
>> >> >
>> >> > However, the generated parquet file, is not recognizable by
>> >> parquet-cpp.  I
>> >> > encountered following error when using the `tools/parquet_reader`
>> binary,
>> >> >
>> >> > ```
>> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
>> >> > ```
>> >> >
>> >> > Further search online get me to this JIRA ticket:
>> >> > https://issues.apache.org/jira/browse/HADOOP-12990
>> >> >
>> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
>> is
>> >> > not compatible with parquet-cpp?
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Alex Wang,
>> >> > Open vSwitch developer
>> >>
>> >
>> >
>> > --
>> > Alex Wang,
>> > Open vSwitch developer
>>
>
>
> --
> Alex Wang,
> Open vSwitch developer

Reply via email to