Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
@Wes

Okay I think I figured it out why I could not read LZ4 encoded parquet file
generated by parquet-mr.

Turns out hadoop LZ4 has its own framing format.

I summarized details in the JIRA ticket you posted:
https://issues.apache.org/jira/browse/PARQUET-1241

Thanks,
Alex Wang,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:57 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below, which tries to identify hadoop 
LZ4 format if the initial try failed:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), 

[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.


 Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
  
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 

[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.


 Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
  
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix.

 

 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
 

The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

 

 
 
Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
 
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ 

[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang commented on PARQUET-1241:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix.

 

 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
 

The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

 

 
 
Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
 
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1352) [CPP] Trying to write an arrow table with structs to a parquet file

2018-08-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16573720#comment-16573720
 ] 

Wes McKinney commented on PARQUET-1352:
---

Either you can contribute to the nested data support project (which is ongoing 
in https://github.com/apache/parquet-cpp/pull/462) or wait for other people to 
do it. I hope it gets done by the end of 2018

> [CPP] Trying to write an arrow table with structs to a parquet file
> ---
>
> Key: PARQUET-1352
> URL: https://issues.apache.org/jira/browse/PARQUET-1352
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Dragan Markovic
>Priority: Major
>
> Relevant issue:[https://github.com/apache/arrow/issues/2287]
>  
> I'm creating a struct with the following schema in arrow: 
> https://pastebin.com/Cc8nreBP
>  
> When I try to convert that table to a .parquet file, the file gets created 
> with a valid schema (the one I posted above) and then throws this exception: 
> "lemented: Level generation for Struct not supported yet".
>  
> Here's the code: [https://ideone.com/DJkKUF]
>  
> Is there any way to write arrow table of structs to a .parquet file in cpp? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
Hi Wes,

Thanks again for the pointers.  During investigation I noticed this
possible bug,

The private variables 'min_', 'max_' in 'TypedRowGroupStatistics' class is
not initialized in constructor.

And I got an abort while trying to read a column using 'parquet_reader'.
And gdb breakpoint at constructor shows that those variables are not
initialized.

```
Breakpoint 1,
parquet::TypedRowGroupStatistics
>::TypedRowGroupStatistics (
this=0xbafd48, schema=0xc17080, encoded_min="", encoded_max="",
num_values=0, null_count=39160, distinct_count=0,
 has_min_max=true, pool=0xbab8e0
)
   at
/opt/gpdbbuild/parquet-cpp/src/parquet/statistics.cc:74
74if
(!encoded_min.empty()) {
(gdb) p min_
$37 = 116
(gdb) p max_
$38 = false
```
And since both 'encoded_min' and 'encoded_max' are empty string, they are
never set...


So, if this is valid issue, i'm proposing the following fix:
```
diff --git a/src/parquet/statistics.cc b/src/parquet/statistics.cc
index ea7f783..5d61bc9 100644
--- a/src/parquet/statistics.cc
+++ b/src/parquet/statistics.cc
@@ -65,6 +65,8 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(
 : pool_(pool),
   min_buffer_(AllocateBuffer(pool_, 0)),
   max_buffer_(AllocateBuffer(pool_, 0)) {
+  using T = typename DType::c_type;
+
   IncrementNumValues(num_values);
   IncrementNullCount(null_count);
   IncrementDistinctCount(distinct_count);
@@ -73,9 +75,13 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(

   if (!encoded_min.empty()) {
 PlainDecode(encoded_min, _);
+  } else {
+min_ = T();
   }
   if (!encoded_max.empty()) {
 PlainDecode(encoded_max, _);
+  } else {
+max_ = T();
   }
   has_min_max_ = has_min_max;
 }
```

Thanks,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > 

Date and time for next Parquet sync

2018-08-08 Thread Nandor Kollar
Hi All,

It has been a while since we had a Parquet sync, therefore I'd like to
propose to have one next week on August 15th, at 6pm CET / 9 am PST.

I'll send a meeting invite with the details soon, let me know if this time
is not suitable for you!

Since the last sync there are couple of topics to discuss, like:
- Status of Parquet encryption
- Release a new minor version, scope of the new release
- Bloom filters
- Move Java specific code from parquet-format to parquet-mr
- parquet.thrift usage best practices in different language bindings (Java,
C++, Python, Rust)
- LZ4 incompatibility

The agenda is open for suggestions.

Regards,
Nandor


[jira] [Created] (PARQUET-1373) Encryption key management tools

2018-08-08 Thread Gidon Gershinsky (JIRA)
Gidon Gershinsky created PARQUET-1373:
-

 Summary: Encryption key management tools 
 Key: PARQUET-1373
 URL: https://issues.apache.org/jira/browse/PARQUET-1373
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Gidon Gershinsky
Assignee: Gidon Gershinsky


Parquet Modular Encryption 
([PARQUET-1178|https://issues.apache.org/jira/browse/PARQUET-1178]) provides an 
API that accepts keys, arbitrary key metadata and key retrieval callbacks - 
which allows to implement basically any key management policy on top of it. 
This Jira will add tools that implement a set of best practice elements for key 
management. This is not an end-to-end key management, but rather a set of 
components that might simplify design and development of an end-to-end solution.

For example, the tools will cover
 * modification of key metadata inside existing Parquet files.
 * support for re-keying that doesn't require modification of Parquet files.

 

Parquet will not mandate a use of these tools. Users will be able to continue 
working with the basic API, to create any custom key management solution that 
addresses their security requirements. If helps, they can also utilize some or 
all of these tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)