Re: hadoop LZ4 incompatible with open source LZ4

ALeX Wang Wed, 08 Aug 2018 11:40:34 -0700

Hi Wes,

Thanks again for the pointers.  During investigation I noticed this
possible bug,


The private variables 'min_', 'max_' in 'TypedRowGroupStatistics' class is
not initialized in constructor.

And I got an abort while trying to read a column using 'parquet_reader'.
And gdb breakpoint at constructor shows that those variables are not
initialized.

```
Breakpoint 1,
parquet::TypedRowGroupStatistics<parquet::DataType<(parquet::Type::type)0>
>::TypedRowGroupStatistics (
    this=0xbafd48, schema=0xc17080, encoded_min="", encoded_max="",
num_values=0, null_count=39160, distinct_count=0,
 has_min_max=true, pool=0xbab8e0
<arrow::default_memory_pool()::default_memory_pool_>)
                       at
/opt/gpdbbuild/parquet-cpp/src/parquet/statistics.cc:74
                                                74        if
(!encoded_min.empty()) {
(gdb) p min_
$37 = 116
(gdb) p max_
$38 = false
```
And since both 'encoded_min' and 'encoded_max' are empty string, they are
never set...


So, if this is valid issue, i'm proposing the following fix:
```
diff --git a/src/parquet/statistics.cc b/src/parquet/statistics.cc
index ea7f783..5d61bc9 100644
--- a/src/parquet/statistics.cc
+++ b/src/parquet/statistics.cc
@@ -65,6 +65,8 @@ TypedRowGroupStatistics<DType>::TypedRowGroupStatistics(
     : pool_(pool),
       min_buffer_(AllocateBuffer(pool_, 0)),
       max_buffer_(AllocateBuffer(pool_, 0)) {
+  using T = typename DType::c_type;
+
   IncrementNumValues(num_values);
   IncrementNullCount(null_count);
   IncrementDistinctCount(distinct_count);
@@ -73,9 +75,13 @@ TypedRowGroupStatistics<DType>::TypedRowGroupStatistics(

   if (!encoded_min.empty()) {
     PlainDecode(encoded_min, &min_);
+  } else {
+    min_ = T();
   }
   if (!encoded_max.empty()) {
     PlainDecode(encoded_max, &max_);
+  } else {
+    max_ = T();
   }
   has_min_max_ = has_min_max;
 }
```

Thanks,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney <[email protected]> wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang <[email protected]> wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney <[email protected]> wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang <[email protected]> wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%[email protected]%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney <[email protected]>
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang <[email protected]>
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer

Re: hadoop LZ4 incompatible with open source LZ4

Reply via email to