[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585290#comment-16585290
 ] 

Alex Wang commented on PARQUET-1241:


Thanks a lot [~jonathan.underw...@gmail.com] for the clarification, did not 
mean to cross posting and saw there was a discussion about how Hadoop codec 
works.

 

If need, I could create another ticket for "parquet-mr using Hadoop codec not 
compatible with arrow/cpp codec".

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585285#comment-16585285
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/19/18 11:26 PM:
--

[~wesmckinn] sorry for this delayed replay,

 

-I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.-  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

 

Sorry on second thought I meant to add LZ4 compressor (which uses open source 
github/lz4-java) to parquet-mr like the SnappyCompressor.java 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java]

 

The reason being even if I added a new lz4 hadoop codec to arrow/cpp, the 
parquet-mr still writes the hadoop LZ4 format and sets the compression type to 
LZ4 in the file's metadata.

 


was (Author: ee07b291):
[~wesmckinn] sorry for this delayed replay,

 

I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-19 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16585285#comment-16585285
 ] 

Alex Wang commented on PARQUET-1241:


[~wesmckinn] sorry for this delayed replay,

 

I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-13 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579277#comment-16579277
 ] 

Alex Wang commented on PARQUET-1241:


Kindly check on the ticket,

 

current arrow uses non framed LZ4, so either change arrow to detect hadoop lz4 
format or add a LZ4 implementation like the SnappyCompress which is based on 
`org.xerial.snappy.Snappy`,

 

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-09 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 9:13 PM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below, which tries to identify hadoop 
LZ4 format if the initial try failed:

 
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..359066e 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 
#include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"
namespace arrow {
@@ -31,13 +32,30 @@ namespace arrow {
Status Lz4Codec::Decompress(int64_t input_len, const uint8_t* input, int64_t 
output_len,
 uint8_t* output_buffer) {
- int64_t decompressed_size = LZ4_decompress_safe(
+ int64_t decompressed_size;
+
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input))) == 
output_len
+ && BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) == 
input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ // For normal lz4 compression, decompress the entire 'input'.
+ } else {
+ decompressed_size = LZ4_decompress_safe(
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
- if (decompressed_size < 0) {
- return Status::IOError("Corrupt Lz4 compressed data.");
+ if (decompressed_size >= 0) {
+ return Status::OK();
 }
- return Status::OK();
+ return Status::IOError("Corrupt Lz4 compressed data.");
 }
int64_t Lz4Codec::MaxCompressedLen(int64_t input_len, 
{noformat}
 Also, I think it is useful to have a separate codec format for LZ4 frame.


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-09 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 6:03 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below, which tries to identify hadoop 
LZ4 format if the initial try failed:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 

Also, I think it is useful to have a separate codec format for LZ4 frame.


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below, which tries to identify hadoop 
LZ4 format if the initial try failed:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util

Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
@Wes

Okay I think I figured it out why I could not read LZ4 encoded parquet file
generated by parquet-mr.

Turns out hadoop LZ4 has its own framing format.

I summarized details in the JIRA ticket you posted:
https://issues.apache.org/jira/browse/PARQUET-1241

Thanks,
Alex Wang,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Would like to kindly confirm my observation,
> >> >> >
> >> >> > We use parquet-mr (java) to generate parquet file with LZ4
> >> compression.
> >> >> To
> >> >> > do this we have to compile/install hadoop native library with
> provides
> >> >> LZ4
> >> >> > codec.
> >> >> >
> >> >> > However, the generated parquet file, is not recognizable by
> >> >> parquet-cpp.  I
> >> >> > encountered following error when using the `tools/parquet_reader`
> >> binary,
> >> >> >
> >> >> > ```
> >> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> >> > ```
> >> >> >
> >> >> > Further search online get me to this JIRA ticket:
> >> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >> >
> >> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr
> lz4
> >> is
> >> >> > not compatible with parquet-cpp?
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > Alex Wang,
> >> >> > Open vSwitch developer
> >> >>
> >> >
> >> >
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:57 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below, which tries to identify hadoop 
LZ4 format if the initial try failed:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_l

[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*. 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.


 Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
  
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input),

[jira] [Comment Edited] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/9/18 5:56 AM:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix. 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.


 Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
  
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 


was (Author: ee07b291):
Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix.

 

 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
 

The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

 

 
 
Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
 
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 r

[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-08-08 Thread Alex Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16574328#comment-16574328
 ] 

Alex Wang commented on PARQUET-1241:


Hi, I and [~wesmckinn] discussed the issue on parquet-cpp mailing list,

 

And upon further investigation I found from the hadoop Jira ticket 
(https://issues.apache.org/jira/browse/HADOOP-12990) that hadoop LZ4 format 
prefixes the compressed data with *original data length (big-endian)* and then 
*compressed data length (big-endian)*.

 

Via gdb into the *parquet_reader* binary while reading a parquet-mr (1.10.0 
release) generated parquet file (with LZ4 compression), I could confirm that 
the compressed column page buffer indeed has the 8-byte prefix.

 

 
{noformat}
# From gdb:
Breakpoint 1, arrow::Lz4Codec::Decompress (this=0xc1d5e0, input_len=109779, 
input=0x7665442e "", output_len=155352, output_buffer=0x7624b040 "")
at 
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/util/compression_lz4.cc:36
36 static_cast(input_len), static_cast(output_len));
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 
libgcc-4.8.5-28.el7_5.1.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64
(gdb) p/x *((uint32_t*)input+1)
$3 = 0xcbac0100

# From python (convert to little endian):
>>> 0x0001accb
109771 {noformat}
 

The result *109771 = 109779 - 8*.  And if I skipped the first 8 bytes and 
decompress, I could get correct column values.

 

 
 
Seems to me that hadoop will not likely change this format (been there since 
2011), I'd like to propose changes like below:
 
{noformat}
diff --git a/cpp/src/arrow/util/compression_lz4.cc 
b/cpp/src/arrow/util/compression_lz4.cc
index 23a5c39..feeb124 100644
--- a/cpp/src/arrow/util/compression_lz4.cc
+++ b/cpp/src/arrow/util/compression_lz4.cc
@@ -22,6 +22,7 @@
 #include 

 #include "arrow/status.h"
+#include "arrow/util/bit-util.h"
 #include "arrow/util/macros.h"

 namespace arrow {
@@ -35,6 +36,19 @@ Status Lz4Codec::Decompress(int64_t input_len, const 
uint8_t* input, int64_t out
 reinterpret_cast(input), reinterpret_cast(output_buffer),
 static_cast(input_len), static_cast(output_len));
 if (decompressed_size < 0) {
+ // For hadoop lz4 compression format, the compressed data is prefixed
+ // with original data length (big-endian) and then compressed data
+ // length (big-endian).
+ //
+ // If the prefix could match the format, try to decompress from 'input + 8'.
+ if (BitUtil::FromBigEndian(*(reinterpret_cast(input) + 1)) 
== input_len - 8) {
+ decompressed_size = LZ4_decompress_safe(
+ reinterpret_cast(input) + 8, 
reinterpret_cast(output_buffer),
+ static_cast(input_len) - 8, static_cast(output_len));
+ if (decompressed_size >= 0) {
+ return Status::OK();
+ }
+ }
 return Status::IOError("Corrupt Lz4 compressed data.");
 }
 return Status::OK();{noformat}
 

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-08 Thread ALeX Wang
Hi Wes,

Thanks again for the pointers.  During investigation I noticed this
possible bug,

The private variables 'min_', 'max_' in 'TypedRowGroupStatistics' class is
not initialized in constructor.

And I got an abort while trying to read a column using 'parquet_reader'.
And gdb breakpoint at constructor shows that those variables are not
initialized.

```
Breakpoint 1,
parquet::TypedRowGroupStatistics
>::TypedRowGroupStatistics (
this=0xbafd48, schema=0xc17080, encoded_min="", encoded_max="",
num_values=0, null_count=39160, distinct_count=0,
 has_min_max=true, pool=0xbab8e0
)
   at
/opt/gpdbbuild/parquet-cpp/src/parquet/statistics.cc:74
74if
(!encoded_min.empty()) {
(gdb) p min_
$37 = 116
(gdb) p max_
$38 = false
```
And since both 'encoded_min' and 'encoded_max' are empty string, they are
never set...


So, if this is valid issue, i'm proposing the following fix:
```
diff --git a/src/parquet/statistics.cc b/src/parquet/statistics.cc
index ea7f783..5d61bc9 100644
--- a/src/parquet/statistics.cc
+++ b/src/parquet/statistics.cc
@@ -65,6 +65,8 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(
 : pool_(pool),
   min_buffer_(AllocateBuffer(pool_, 0)),
   max_buffer_(AllocateBuffer(pool_, 0)) {
+  using T = typename DType::c_type;
+
   IncrementNumValues(num_values);
   IncrementNullCount(null_count);
   IncrementDistinctCount(distinct_count);
@@ -73,9 +75,13 @@ TypedRowGroupStatistics::TypedRowGroupStatistics(

   if (!encoded_min.empty()) {
 PlainDecode(encoded_min, _);
+  } else {
+min_ = T();
   }
   if (!encoded_max.empty()) {
 PlainDecode(encoded_max, _);
+  } else {
+max_ = T();
   }
   has_min_max_ = has_min_max;
 }
```

Thanks,

On Tue, 7 Aug 2018 at 12:13, Wes McKinney  wrote:

> hi Alex,
>
> here's one thread I remember about this
>
> https://github.com/dask/fastparquet/issues/314#issuecomment-371629605
>
> and a relevant unresolved JIRA
>
> https://issues.apache.org/jira/browse/PARQUET-1241
>
> The first step to resolving this issue is to reconcile what mode of
> LZ4 the Parquet format is supposed to be using
>
> - Wes
>
>
> On Tue, Aug 7, 2018 at 2:10 PM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Just to share my understanding,
> >
> > In Arrow, my understanding is that it downloads the lz4 from
> > https://github.com/lz4/lz4 (via export
> > LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it
> is
> > using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
> > see if I could implement a CodecFactory handle for LZ4_FRAMED in
> parquet-mr,
> >
> > Thanks,
> >
> >
> > On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> No, if you look at the implementation in
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> >> it is not using the same LZ4 compression style that Hadoop is using;
> >> realistically we need to add a bunch of options to Lz4Codec to be able
> >> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> >> my e-mail to find the prior thread
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> >> > Hi Wes,
> >> >
> >> > Are you talking about this ?
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >> >
> >> > I tried to compile with the latest arrow which contain this fix and
> still
> >> > encountered the corruption error.
> >> >
> >> > Also, we tried to read the file using pyparquet, and spark, did not
> work
> >> > either,
> >> >
> >> > Thanks,
> >> > Alex Wang,
> >> >
> >> >
> >> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney 
> wrote:
> >> >
> >> >> hi Alex,
> >> >>
> >> >> I think there was an e-mail thread or JIRA about this, would have to
> >> >> dig it up. LZ4 compression was originally underspecified (has that
> >> >> been fixed) and we aren't using the correct compressor/decompressor
> >> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> >> Apache Arrow
> >> >>
> >> >> - Wes
> >> >>
> >> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang 
> 

Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Just to share my understanding,

In Arrow, my understanding is that it downloads the lz4 from
https://github.com/lz4/lz4 (via export
LZ4_STATIC_LIB=$ARROW_EP/lz4_ep-prefix/src/lz4_ep/lib/liblz4.a).  So it is
using the LZ4_FRAMED codec.  But hadoop is not using framed lz4.  So i'll
see if I could implement a CodecFactory handle for LZ4_FRAMED in parquet-mr,

Thanks,


On Tue, 7 Aug 2018 at 08:50, Wes McKinney  wrote:

> hi Alex,
>
> No, if you look at the implementation in
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/compression_lz4.cc#L32
> it is not using the same LZ4 compression style that Hadoop is using;
> realistically we need to add a bunch of options to Lz4Codec to be able
> to select what we want (or add LZ4_FRAMED codec). I'll have to dig in
> my e-mail to find the prior thread
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:45 AM, ALeX Wang  wrote:
> > Hi Wes,
> >
> > Are you talking about this ?
> >
> http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E
> >
> > I tried to compile with the latest arrow which contain this fix and still
> > encountered the corruption error.
> >
> > Also, we tried to read the file using pyparquet, and spark, did not work
> > either,
> >
> > Thanks,
> > Alex Wang,
> >
> >
> > On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:
> >
> >> hi Alex,
> >>
> >> I think there was an e-mail thread or JIRA about this, would have to
> >> dig it up. LZ4 compression was originally underspecified (has that
> >> been fixed) and we aren't using the correct compressor/decompressor
> >> options in parquet-cpp at the moment. If you have time to dig in and
> >> fix it, it would be much appreciated. Note that the LZ4 code lives in
> >> Apache Arrow
> >>
> >> - Wes
> >>
> >> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> >> > Hi,
> >> >
> >> > Would like to kindly confirm my observation,
> >> >
> >> > We use parquet-mr (java) to generate parquet file with LZ4
> compression.
> >> To
> >> > do this we have to compile/install hadoop native library with provides
> >> LZ4
> >> > codec.
> >> >
> >> > However, the generated parquet file, is not recognizable by
> >> parquet-cpp.  I
> >> > encountered following error when using the `tools/parquet_reader`
> binary,
> >> >
> >> > ```
> >> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> >> > ```
> >> >
> >> > Further search online get me to this JIRA ticket:
> >> > https://issues.apache.org/jira/browse/HADOOP-12990
> >> >
> >> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4
> is
> >> > not compatible with parquet-cpp?
> >> >
> >> > Thanks,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


Re: hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi Wes,

Are you talking about this ?
http://mail-archives.apache.org/mod_mbox/arrow-issues/201805.mbox/%3cjira.13158639.1526013615000.61149.1526228880...@atlassian.jira%3E

I tried to compile with the latest arrow which contain this fix and still
encountered the corruption error.

Also, we tried to read the file using pyparquet, and spark, did not work
either,

Thanks,
Alex Wang,


On Tue, 7 Aug 2018 at 08:37, Wes McKinney  wrote:

> hi Alex,
>
> I think there was an e-mail thread or JIRA about this, would have to
> dig it up. LZ4 compression was originally underspecified (has that
> been fixed) and we aren't using the correct compressor/decompressor
> options in parquet-cpp at the moment. If you have time to dig in and
> fix it, it would be much appreciated. Note that the LZ4 code lives in
> Apache Arrow
>
> - Wes
>
> On Tue, Aug 7, 2018 at 11:10 AM, ALeX Wang  wrote:
> > Hi,
> >
> > Would like to kindly confirm my observation,
> >
> > We use parquet-mr (java) to generate parquet file with LZ4 compression.
> To
> > do this we have to compile/install hadoop native library with provides
> LZ4
> > codec.
> >
> > However, the generated parquet file, is not recognizable by
> parquet-cpp.  I
> > encountered following error when using the `tools/parquet_reader` binary,
> >
> > ```
> > Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
> > ```
> >
> > Further search online get me to this JIRA ticket:
> > https://issues.apache.org/jira/browse/HADOOP-12990
> >
> > So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
> > not compatible with parquet-cpp?
> >
> > Thanks,
> > --
> > Alex Wang,
> > Open vSwitch developer
>


-- 
Alex Wang,
Open vSwitch developer


hadoop LZ4 incompatible with open source LZ4

2018-08-07 Thread ALeX Wang
Hi,

Would like to kindly confirm my observation,

We use parquet-mr (java) to generate parquet file with LZ4 compression.  To
do this we have to compile/install hadoop native library with provides LZ4
codec.

However, the generated parquet file, is not recognizable by parquet-cpp.  I
encountered following error when using the `tools/parquet_reader` binary,

```
Parquet error: Arrow error: IOError: Corrupt Lz4 compressed data.
```

Further search online get me to this JIRA ticket:
https://issues.apache.org/jira/browse/HADOOP-12990

So, since hadoop LZ4 is incompatible with open source, parquet-mr lz4 is
not compatible with parquet-cpp?

Thanks,
-- 
Alex Wang,
Open vSwitch developer


Re: Small malloc at file open and metadata parsing

2018-07-30 Thread ALeX Wang
Thanks for the quick reply @Wes,

Too bad this is causing a lot of delays (due to page fault handing) for
light queries (ones that query only few rows/columns),

Will try to use jemallc and see,,,

One more question, when i upgrade to 1.4.0 or later code, and use the same
cmake options, and environment, OpenFile result in segfault,,,

```
awake@ev003:/tmp$ cat tmpfile
(gdb) where
#0  0x7fc542eebc3c in free () from /lib64/libc.so.6
#1  0x00f13cb1 in arrow::DefaultMemoryPool::Free (this=0x16e71e0
, buffer=0x7fc52f425040
, size=616512)
at
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/memory_pool.cc:147
#2  0x00f117b6 in arrow::PoolBuffer::~PoolBuffer (this=0x34b5fb8,
__in_chrg=) at
/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep/cpp/src/arrow/buffer.cc:70
#3  0x00e364b7 in
__gnu_cxx::new_allocator::destroy
(this=0x34b5fb0, __p=0x34b5fb8) at
/usr/include/c++/4.8.2/ext/new_allocator.h:124
#4  0x00e35e10 in
std::allocator_traits
>::_S_destroy (__a=..., __p=0x34b5fb8) at
/usr/include/c++/4.8.2/bits/alloc_traits.h:281
#5  0x00e34ea3 in
std::allocator_traits
>::destroy (__a=..., __p=0x34b5fb8) at
/usr/include/c++/4.8.2/bits/alloc_traits.h:405
#6  0x00e33f01 in std::_Sp_counted_ptr_inplace, (__gnu_cxx::_Lock_policy)2>::_M_dispose
(this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:407
#7  0x00e27748 in
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release
(this=0x34b5fa0) at /usr/include/c++/4.8.2/bits/shared_ptr_base.h:144
#8  0x00e255bb in
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
(this=0x7ffea5fffc88, __in_chrg=) at
/usr/include/c++/4.8.2/bits/shared_ptr_base.h:546
#9  0x00e23eae in std::__shared_ptr::~__shared_ptr (this=0x7ffea5fffc80,
__in_chrg=) at
/usr/include/c++/4.8.2/bits/shared_ptr_base.h:781
#10 0x00e23ec8 in std::shared_ptr::~shared_ptr
(this=0x7ffea5fffc80, __in_chrg=) at
/usr/include/c++/4.8.2/bits/shared_ptr.h:93
#11 0x00e875a4 in parquet::SerializedFile::ParseMetaData
(this=0x34b5f60) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
#12 0x00e858d4 in parquet::ParquetFileReader::Contents::Open
(source=std::unique_ptr containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:247
---Type  to continue, or q  to quit---
#13 0x00e85a6f in parquet::ParquetFileReader::Open
(source=std::unique_ptr containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:265
#14 0x00e859ba in parquet::ParquetFileReader::Open
(source=std::shared_ptr (count 2, weak 0) 0x34b5e50, props=...,
metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:259
#15 0x00e85df4 in parquet::ParquetFileReader::OpenFile
(path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:287
```

Is this a known issue?

Thanks,
Alex Wang,



On Mon, Jul 30, 2018, 11:22 AM Wes McKinney  wrote:

> hi Alex,
>
> It looks like the mallocs are coming from Thrift
> (parquet/parquet_types.cpp is generated by Thrift). I'm not sure if we
> can do much about this. I'm curious if it's possible to pass a custom
> STL allocator to Thrift so we could use a different allocation
> strategy than the default STL allocator
>
> - Wes
>
> On Mon, Jul 30, 2018 at 1:54 PM, ALeX Wang  wrote:
> > Hi,
> >
> > I'm reading parquet file (generated by Java parquet library).  Our schema
> > has 400 columns (including non-array elements, 1-dimensional array
> > elements).
> >
> > I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,
> >
> > I compile parquet-cpp with following cmake options,
> > ```
> > cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF
> >  -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static"
> >  -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF .
> > ```
> >
> > One thing we noticed is that the cpp library conducts a lot of small
> > mallocs during the open file and the reading metadata phases...  shown
> > below:
> >
> > ```
> > (gdb) where
> > #0  0x7fdf40594801 in malloc () from /lib64/libc.so.6
> > #1  0x7fdf40e52ecd in operator new(unsigned long) () from
> > /lib64/libstdc++.so.6
> > #2  0x00ea16c0 in __gnu_cxx::new_allocator::allocate
> > (this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
> > #3  0x00e9eabb in std::_Vector_base > std::allocator >::_M_allocate (this=0x33e6930, __n=3) at
> > /usr/include/c++/4.8.2/bits/stl_vector.h:168
> > #4  0x00ecf512 in s

Small malloc at file open and metadata parsing

2018-07-30 Thread ALeX Wang
Hi,

I'm reading parquet file (generated by Java parquet library).  Our schema
has 400 columns (including non-array elements, 1-dimensional array
elements).

I'm using release 1.3.1, gcc 4.8.5, boost static library 1.53,

I compile parquet-cpp with following cmake options,
```
cmake3-DCMAKE_BUILD_TYPE=Debug -DPARQUET_BUILD_EXAMPLES=OFF
 -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW_LINKAGE="static"
 -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF .
```

One thing we noticed is that the cpp library conducts a lot of small
mallocs during the open file and the reading metadata phases...  shown
below:

```
(gdb) where
#0  0x7fdf40594801 in malloc () from /lib64/libc.so.6
#1  0x7fdf40e52ecd in operator new(unsigned long) () from
/lib64/libstdc++.so.6
#2  0x00ea16c0 in __gnu_cxx::new_allocator::allocate
(this=0x33e6930, __n=3) at /usr/include/c++/4.8.2/ext/new_allocator.h:104
#3  0x00e9eabb in std::_Vector_base >::_M_allocate (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:168
#4  0x00ecf512 in std::vector >::_M_default_append (this=0x33e6930, __n=3) at
/usr/include/c++/4.8.2/bits/vector.tcc:549
#5  0x00eca887 in std::vector >::resize (this=0x33e6930, __new_size=3) at
/usr/include/c++/4.8.2/bits/stl_vector.h:667
#6  0x00ebd589 in parquet::format::ColumnMetaData::read
(this=0x33e6908, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:3845
#7  0x00ebf9ed in parquet::format::ColumnChunk::read
(this=0x33e68f0, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:4246
#8  0x00ec0cd2 in parquet::format::RowGroup::read (this=0x33cf7c0,
iprot=0x3337300) at /opt/parquet-cpp/src/parquet/parquet_types.cpp:4451
#9  0x00ec4e22 in parquet::format::FileMetaData::read
(this=0x3337270, iprot=0x3337300) at
/opt/parquet-cpp/src/parquet/parquet_types.cpp:5385
#10 0x00e9364d in
parquet::DeserializeThriftMsg
(buf=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
len=0x7ffc8c96ff34, deserialized_msg=0x3337270) at
/opt/parquet-cpp/src/parquet/thrift.h:119
#11 0x00e8fda5 in
parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl (this=0x3302fb0,
metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:303
#12 0x00e8bf4f in parquet::FileMetaData::FileMetaData
(this=0x31a4ca0, metadata=0x7fdf2cace040
"\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:403
#13 0x00e8bee3 in parquet::FileMetaData::Make
(metadata=0x7fdf2cace040 "\025\002\031\374\313\004H\bsessions\025\374\005",
metadata_len=0x7ffc8c96ff34) at /opt/parquet-cpp/src/parquet/metadata.cc:398
#14 0x00e87572 in parquet::SerializedFile::ParseMetaData
(this=0x3241450) at /opt/parquet-cpp/src/parquet/file_reader.cc:213
#15 0x00e858d4 in parquet::ParquetFileReader::Contents::Open
(source=std::unique_ptr containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:247
#16 0x00e85a6f in parquet::ParquetFileReader::Open
(source=std::unique_ptr containing 0x0,
props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:265
#17 0x00e859ba in parquet::ParquetFileReader::Open
(source=std::shared_ptr (count 2, weak 0) 0x32e2e80, props=...,
metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:259
#18 0x00e85df4 in parquet::ParquetFileReader::OpenFile
(path="/data-slow/data0/test_parquet_file/seg0/0_1530129731023-1530136801030",
memory_map=false, props=..., metadata=std::shared_ptr (empty) 0x0) at
/opt/parquet-cpp/src/parquet/file_reader.cc:287

(gdb) info br
Num Type   Disp Enb AddressWhat
1   breakpoint keep y   
breakpoint already hit 2679 times
ignore next 2321 hits
```

I set the breakpoint to `malloc`, above ^

This seems to be the case regardless of mmap option.

Would really appreciate some pointer on how to avoid this.

Thanks,
Alex Wang,

-- 
Alex Wang,
Open vSwitch developer


Re: Question about my use case.

2018-03-14 Thread ALeX Wang
Hi Ryan,

Thanks for the reply,

We are using samza for streaming,

Regarding parquet java, then i must have not used the APIs right,,, since
last time we tried, we have 7 hadoop processes spawned for writing to a
single file and it was much slower than our parquet c++ alternative,

Thanks,


On 14 March 2018 at 09:06, Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Alex,
>
> I don't think what you're trying to do makes sense. If you're using Scala,
> then your data is already in the JVM and it is probably much easier to
> write it to Parquet using the Java library. While that library depends on
> Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem interface
> can be used to write directly to local disk or a number of other stores,
> like S3. Using the Java library would allow you to write the data directly,
> instead of translating to Arrow first.
>
> Since you want to use Scala, then the easiest way to get this support is
> probably to write using Spark, which has most of what you need ready to go.
> If you're using a different streaming system you might not want both. What
> are you using?
>
> rb
>
> On Tue, Mar 13, 2018 at 6:11 PM, ALeX Wang <ee07b...@gmail.com> wrote:
>
> > Also could i get a pointer to example that write parquet file from arrow
> > memory buffer directly?
> >
> > The part i'm currently missing is how to derive the repetition level and
> > definition level@@
> >
> > Thanks,
> >
> > On 13 March 2018 at 17:52, ALeX Wang <ee07b...@gmail.com> wrote:
> >
> > > hi,
> > >
> > > i know it is may not be the best place to ask but would like to try
> > > anyways, as it is quite hard for me to find good example of this
> online.
> > >
> > > My usecase:
> > >
> > > i'd like to generate from streaming data (using Scala) into arrow
> format
> > > in memory mapped file and then have my parquet-cpp program writing it
> as
> > > parquet file to disk.
> > >
> > > my understanding is that java parquet only implements HDFS writer,
> which
> > > is not my use case (not using hadoop) and parquet-cpp is much more
> > > succinct.
> > >
> > > My question:
> > >
> > > does my usecase make sense? or if there is better way?
> > >
> > > Thanks,
> > > --
> > > Alex Wang,
> > > Open vSwitch developer
> > >
> >
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Alex Wang,
Open vSwitch developer


Re: Question about my use case.

2018-03-13 Thread ALeX Wang
Also could i get a pointer to example that write parquet file from arrow
memory buffer directly?

The part i'm currently missing is how to derive the repetition level and
definition level@@

Thanks,

On 13 March 2018 at 17:52, ALeX Wang <ee07b...@gmail.com> wrote:

> hi,
>
> i know it is may not be the best place to ask but would like to try
> anyways, as it is quite hard for me to find good example of this online.
>
> My usecase:
>
> i'd like to generate from streaming data (using Scala) into arrow format
> in memory mapped file and then have my parquet-cpp program writing it as
> parquet file to disk.
>
> my understanding is that java parquet only implements HDFS writer, which
> is not my use case (not using hadoop) and parquet-cpp is much more
> succinct.
>
> My question:
>
> does my usecase make sense? or if there is better way?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer


Question about my use case.

2018-03-13 Thread ALeX Wang
hi,

i know it is may not be the best place to ask but would like to try
anyways, as it is quite hard for me to find good example of this online.

My usecase:

i'd like to generate from streaming data (using Scala) into arrow format in
memory mapped file and then have my parquet-cpp program writing it as
parquet file to disk.

my understanding is that java parquet only implements HDFS writer, which is
not my use case (not using hadoop) and parquet-cpp is much more succinct.

My question:

does my usecase make sense? or if there is better way?

Thanks,
-- 
Alex Wang,
Open vSwitch developer


Recommended rowgroup size, and number of row groups for large table

2018-01-12 Thread ALeX Wang
Hi,

I'm using parquet to store a big table (400+ columns), and most of columns
will be none

Is there any recommended rowgroup size and the number of row groups per
parquet file for my use case?  Or is there any reference/paper that I could
read myself,


Thanks,
-- 
Alex Wang,
Open vSwitch developer


Re: Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-10 Thread ALeX Wang
Oh, i see, i was using `apache-parquet-cpp-1.3.0`,,, which by default gets
arrow 97f9029ce835dfc2655ca91b9820a2e6aed89107

I'm all good now, will switch to use 1.3.1 or later,

Thanks for pointing this out,



On 10 January 2018 at 07:01, Deepak Majeti <majeti.dee...@gmail.com> wrote:

> Looks like this dependency requirement is already fixed on arrow master
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L125
>
>
> On Tue, Jan 9, 2018 at 4:04 PM, ALeX Wang <ee07b...@gmail.com> wrote:
> > So, I ran `c++ -E` and confirmed that the `` is
> > indeed included here in arrow:
> >
> > https://github.com/apache/arrow/blob/97f9029ce835dfc2655ca91b9820a2
> e6aed89107/cpp/src/arrow/io/file.cc#L110-L111
> >
> > And it subsequently include `` which refers to
> > the `codecvt`... ... ...
> >
> > SO, seems to me that there is a missing dependency to the
> > `libboost_filesystem.a`,
> >
> > Thanks,
> >
> > On 7 January 2018 at 20:58, Deepak Majeti <majeti.dee...@gmail.com>
> wrote:
> >
> >> For static linking, certain boost libraries are transitively needed by
> >> parquet-cpp due to its dependency on Arrow.
> >>
> >> Below is a link to the arrow version your parquet-cpp branch is linking
> >> against
> >> https://github.com/apache/arrow/blob/97f9029ce835dfc2655ca91b9820a2
> >> e6aed89107/cpp/src/arrow/io/file.cc
> >>
> >> I see that "codecvt" header and symbols are included only if "_MSC_VER"
> is
> >> defined.
> >>
> >> You should not be seeing the "undefined reference to
> >> `boost::filesystem::path::codecvt()" on Centos7 with GNU compiler.
> >>
> >> On Sun, Jan 7, 2018 at 4:26 PM, ALeX Wang <ee07b...@gmail.com> wrote:
> >>
> >> > My environment is Centos7,
> >> >
> >> > I do not follow the question, since when i run `cmake` it says the
> >> version
> >> > arrow to use,,, e.g.
> >> >
> >> > ```
> >> > -- Building Apache Arrow from commit:
> >> > 97f9029ce835dfc2655ca91b9820a2e6aed89107
> >> > -- Build Type: RELEASE
> >> > -- Compiler id: GNU
> >> > Selected compiler gcc 4.8.5
> >> > -- Found cpplint executable at /opt/parquet-cpp/build-
> support/cpplint.py
> >> > ```
> >> >
> >> > Again, I was referring to the environment variables setting in the
> >> > `ci/travis_script_static.sh`, and set my env vars like below:
> >> >
> >> > ```
> >> >   190  export ARROW_EP=/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep-
> >> build
> >> >   191  export BROTLI_EP=$ARROW_EP/brotli_ep/src/brotli_ep-install/lib
> >> >   193  export
> >> > SNAPPY_STATIC_LIB=$ARROW_EP/snappy_ep/src/snappy_ep-
> >> > install/lib/libsnappy.a
> >> >   194  export BROTLI_STATIC_LIB_ENC=$BROTLI_EP/libbrotlienc.a
> >> >   195  export BROTLI_STATIC_LIB_DEC=$BROTLI_EP/libbrotlidec.a
> >> >   196  export BROTLI_STATIC_LIB_COMMON=$BROTLI_EP/libbrotlicommon.a
> >> >   197  export
> >> > ZLIB_STATIC_LIB=$ARROW_EP/zlib_ep/src/zlib_ep-install/lib/libz.a
> >> > ```
> >> >
> >> > And then i ran `cmake`...
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > On 7 January 2018 at 03:56, Deepak Majeti <majeti.dee...@gmail.com>
> >> wrote:
> >> >
> >> > > What is the Arrow version of your setup? Can you verify if you see
> the
> >> > same
> >> > > issue on parquet-cpp master?
> >> > > It is possible we are missing the boost filesystem library
> dependency
> >> for
> >> > > the static build with MSVC.
> >> > >
> >> > > On Sat, Jan 6, 2018 at 10:33 AM, ALeX Wang <ee07b...@gmail.com>
> wrote:
> >> > >
> >> > > > Okay, my following change makes it working,,, could anyone please
> >> help
> >> > > > confirm,
> >> > > >
> >> > > > ```
> >> > > > diff --git a/CMakeLists.txt b/CMakeLists.txt
> >> > > > index fef0cab..25defd6 100644
> >> > > > --- a/CMakeLists.txt
> >> > > > +++ b/CMakeLists.txt
> >> > > > @@ -500,7 +500,8 @@ if (PARQUET_BOOST_USE_SHARED)
> >> > > >  else()
> >> > > >set(BOOST_LINK_LIBS
> >> > > >  bo

Re: Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-09 Thread ALeX Wang
So, I ran `c++ -E` and confirmed that the `` is
indeed included here in arrow:

https://github.com/apache/arrow/blob/97f9029ce835dfc2655ca91b9820a2e6aed89107/cpp/src/arrow/io/file.cc#L110-L111

And it subsequently include `` which refers to
the `codecvt`... ... ...

SO, seems to me that there is a missing dependency to the
`libboost_filesystem.a`,

Thanks,

On 7 January 2018 at 20:58, Deepak Majeti <majeti.dee...@gmail.com> wrote:

> For static linking, certain boost libraries are transitively needed by
> parquet-cpp due to its dependency on Arrow.
>
> Below is a link to the arrow version your parquet-cpp branch is linking
> against
> https://github.com/apache/arrow/blob/97f9029ce835dfc2655ca91b9820a2
> e6aed89107/cpp/src/arrow/io/file.cc
>
> I see that "codecvt" header and symbols are included only if "_MSC_VER" is
> defined.
>
> You should not be seeing the "undefined reference to
> `boost::filesystem::path::codecvt()" on Centos7 with GNU compiler.
>
> On Sun, Jan 7, 2018 at 4:26 PM, ALeX Wang <ee07b...@gmail.com> wrote:
>
> > My environment is Centos7,
> >
> > I do not follow the question, since when i run `cmake` it says the
> version
> > arrow to use,,, e.g.
> >
> > ```
> > -- Building Apache Arrow from commit:
> > 97f9029ce835dfc2655ca91b9820a2e6aed89107
> > -- Build Type: RELEASE
> > -- Compiler id: GNU
> > Selected compiler gcc 4.8.5
> > -- Found cpplint executable at /opt/parquet-cpp/build-support/cpplint.py
> > ```
> >
> > Again, I was referring to the environment variables setting in the
> > `ci/travis_script_static.sh`, and set my env vars like below:
> >
> > ```
> >   190  export ARROW_EP=/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep-
> build
> >   191  export BROTLI_EP=$ARROW_EP/brotli_ep/src/brotli_ep-install/lib
> >   193  export
> > SNAPPY_STATIC_LIB=$ARROW_EP/snappy_ep/src/snappy_ep-
> > install/lib/libsnappy.a
> >   194  export BROTLI_STATIC_LIB_ENC=$BROTLI_EP/libbrotlienc.a
> >   195  export BROTLI_STATIC_LIB_DEC=$BROTLI_EP/libbrotlidec.a
> >   196  export BROTLI_STATIC_LIB_COMMON=$BROTLI_EP/libbrotlicommon.a
> >   197  export
> > ZLIB_STATIC_LIB=$ARROW_EP/zlib_ep/src/zlib_ep-install/lib/libz.a
> > ```
> >
> > And then i ran `cmake`...
> >
> >
> > Thanks,
> >
> > On 7 January 2018 at 03:56, Deepak Majeti <majeti.dee...@gmail.com>
> wrote:
> >
> > > What is the Arrow version of your setup? Can you verify if you see the
> > same
> > > issue on parquet-cpp master?
> > > It is possible we are missing the boost filesystem library dependency
> for
> > > the static build with MSVC.
> > >
> > > On Sat, Jan 6, 2018 at 10:33 AM, ALeX Wang <ee07b...@gmail.com> wrote:
> > >
> > > > Okay, my following change makes it working,,, could anyone please
> help
> > > > confirm,
> > > >
> > > > ```
> > > > diff --git a/CMakeLists.txt b/CMakeLists.txt
> > > > index fef0cab..25defd6 100644
> > > > --- a/CMakeLists.txt
> > > > +++ b/CMakeLists.txt
> > > > @@ -500,7 +500,8 @@ if (PARQUET_BOOST_USE_SHARED)
> > > >  else()
> > > >set(BOOST_LINK_LIBS
> > > >  boost_static_regex
> > > > -boost_static_system)
> > > > +boost_static_system
> > > > +boost_static_filesystem)
> > > >  endif()
> > > >
> > > >  #
> > > > diff --git a/cmake_modules/ThirdpartyToolchain.cmake
> > > > b/cmake_modules/ThirdpartyToolchain.cmake
> > > > index 1221765..d19c4eb 100644
> > > > --- a/cmake_modules/ThirdpartyToolchain.cmake
> > > > +++ b/cmake_modules/ThirdpartyToolchain.cmake
> > > > @@ -81,13 +81,15 @@ if (PARQUET_BOOST_USE_SHARED)
> > > >  else()
> > > ># Find static Boost libraries.
> > > >set(Boost_USE_STATIC_LIBS ON)
> > > > -  find_package(Boost COMPONENTS regex system REQUIRED)
> > > > +  find_package(Boost COMPONENTS regex system filesystem REQUIRED)
> > > >if ("${UPPERCASE_BUILD_TYPE}" STREQUAL "DEBUG")
> > > >  set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG})
> > > >  set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG})
> > > > +set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_
> > > > DEBUG})
> > > >else()
> > > >  set(BOOST_STATIC

Re: Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-07 Thread ALeX Wang
My environment is Centos7,

I do not follow the question, since when i run `cmake` it says the version
arrow to use,,, e.g.

```
-- Building Apache Arrow from commit:
97f9029ce835dfc2655ca91b9820a2e6aed89107
-- Build Type: RELEASE
-- Compiler id: GNU
Selected compiler gcc 4.8.5
-- Found cpplint executable at /opt/parquet-cpp/build-support/cpplint.py
```

Again, I was referring to the environment variables setting in the
`ci/travis_script_static.sh`, and set my env vars like below:

```
  190  export ARROW_EP=/opt/parquet-cpp/arrow_ep-prefix/src/arrow_ep-build
  191  export BROTLI_EP=$ARROW_EP/brotli_ep/src/brotli_ep-install/lib
  193  export
SNAPPY_STATIC_LIB=$ARROW_EP/snappy_ep/src/snappy_ep-install/lib/libsnappy.a
  194  export BROTLI_STATIC_LIB_ENC=$BROTLI_EP/libbrotlienc.a
  195  export BROTLI_STATIC_LIB_DEC=$BROTLI_EP/libbrotlidec.a
  196  export BROTLI_STATIC_LIB_COMMON=$BROTLI_EP/libbrotlicommon.a
  197  export
ZLIB_STATIC_LIB=$ARROW_EP/zlib_ep/src/zlib_ep-install/lib/libz.a
```

And then i ran `cmake`...


Thanks,

On 7 January 2018 at 03:56, Deepak Majeti <majeti.dee...@gmail.com> wrote:

> What is the Arrow version of your setup? Can you verify if you see the same
> issue on parquet-cpp master?
> It is possible we are missing the boost filesystem library dependency for
> the static build with MSVC.
>
> On Sat, Jan 6, 2018 at 10:33 AM, ALeX Wang <ee07b...@gmail.com> wrote:
>
> > Okay, my following change makes it working,,, could anyone please help
> > confirm,
> >
> > ```
> > diff --git a/CMakeLists.txt b/CMakeLists.txt
> > index fef0cab..25defd6 100644
> > --- a/CMakeLists.txt
> > +++ b/CMakeLists.txt
> > @@ -500,7 +500,8 @@ if (PARQUET_BOOST_USE_SHARED)
> >  else()
> >set(BOOST_LINK_LIBS
> >  boost_static_regex
> > -boost_static_system)
> > +boost_static_system
> > +boost_static_filesystem)
> >  endif()
> >
> >  #
> > diff --git a/cmake_modules/ThirdpartyToolchain.cmake
> > b/cmake_modules/ThirdpartyToolchain.cmake
> > index 1221765..d19c4eb 100644
> > --- a/cmake_modules/ThirdpartyToolchain.cmake
> > +++ b/cmake_modules/ThirdpartyToolchain.cmake
> > @@ -81,13 +81,15 @@ if (PARQUET_BOOST_USE_SHARED)
> >  else()
> ># Find static Boost libraries.
> >set(Boost_USE_STATIC_LIBS ON)
> > -  find_package(Boost COMPONENTS regex system REQUIRED)
> > +  find_package(Boost COMPONENTS regex system filesystem REQUIRED)
> >if ("${UPPERCASE_BUILD_TYPE}" STREQUAL "DEBUG")
> >  set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG})
> >  set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG})
> > +set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_
> > DEBUG})
> >else()
> >  set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE})
> >  set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE})
> > +set(BOOST_STATIC_FILESYSTEM_LIBRARY
> > ${Boost_FILESYSTEM_LIBRARY_RELEASE})
> >endif()
> >  endif()
> >
> > @@ -115,6 +117,8 @@ else()
> >set_target_properties(boost_static_regex PROPERTIES IMPORTED_LOCATION
> > ${BOOST_STATIC_REGEX_LIBRARY})
> >add_library(boost_static_system STATIC IMPORTED)
> >set_target_properties(boost_static_system PROPERTIES
> IMPORTED_LOCATION
> > ${BOOST_STATIC_SYSTEM_LIBRARY})
> > +  add_library(boost_static_filesystem STATIC IMPORTED)
> > +  set_target_properties(boost_static_filesystem PROPERTIES
> > IMPORTED_LOCATION ${BOOST_STATIC_FILESYSTEM_LIBRARY})
> >  endif()
> >
> >  include_directories(SYSTEM ${Boost_INCLUDE_DIRS})
> > ```
> >
> > On 5 January 2018 at 20:29, ALeX Wang <ee07b...@gmail.com> wrote:
> >
> > > Also, i'm building from
> > >
> > > ```
> > > commit 18ca3922e688a3a730d693ff8f2cfbfd65da8c46
> > > Author: Uwe L. Korn <u...@apache.org>
> > > Date:   Sun Sep 17 13:55:43 2017 -0400
> > >
> > > PARQUET-1094: Add benchmark for boolean Arrow column I/O
> > >
> > > Author: Uwe L. Korn <u...@apache.org>
> > >
> > > Closes #391 from xhochy/PARQUET-1094 and squashes the following
> > > commits:
> > >
> > > 089bb3c [Uwe L. Korn] PARQUET-1094: Add benchmark for boolean Arrow
> > > column I/O
> > > ```
> > >
> > > On 5 January 2018 at 20:27, ALeX Wang <ee07b...@gmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm referring to the https://github.com/apache/
>

Re: Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-05 Thread ALeX Wang
Okay, my following change makes it working,,, could anyone please help
confirm,

```
diff --git a/CMakeLists.txt b/CMakeLists.txt
index fef0cab..25defd6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -500,7 +500,8 @@ if (PARQUET_BOOST_USE_SHARED)
 else()
   set(BOOST_LINK_LIBS
 boost_static_regex
-boost_static_system)
+boost_static_system
+boost_static_filesystem)
 endif()

 #
diff --git a/cmake_modules/ThirdpartyToolchain.cmake
b/cmake_modules/ThirdpartyToolchain.cmake
index 1221765..d19c4eb 100644
--- a/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cmake_modules/ThirdpartyToolchain.cmake
@@ -81,13 +81,15 @@ if (PARQUET_BOOST_USE_SHARED)
 else()
   # Find static Boost libraries.
   set(Boost_USE_STATIC_LIBS ON)
-  find_package(Boost COMPONENTS regex system REQUIRED)
+  find_package(Boost COMPONENTS regex system filesystem REQUIRED)
   if ("${UPPERCASE_BUILD_TYPE}" STREQUAL "DEBUG")
 set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG})
 set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG})
+set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG})
   else()
 set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE})
 set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE})
+set(BOOST_STATIC_FILESYSTEM_LIBRARY
${Boost_FILESYSTEM_LIBRARY_RELEASE})
   endif()
 endif()

@@ -115,6 +117,8 @@ else()
   set_target_properties(boost_static_regex PROPERTIES IMPORTED_LOCATION
${BOOST_STATIC_REGEX_LIBRARY})
   add_library(boost_static_system STATIC IMPORTED)
   set_target_properties(boost_static_system PROPERTIES IMPORTED_LOCATION
${BOOST_STATIC_SYSTEM_LIBRARY})
+  add_library(boost_static_filesystem STATIC IMPORTED)
+  set_target_properties(boost_static_filesystem PROPERTIES
IMPORTED_LOCATION ${BOOST_STATIC_FILESYSTEM_LIBRARY})
 endif()

 include_directories(SYSTEM ${Boost_INCLUDE_DIRS})
```

On 5 January 2018 at 20:29, ALeX Wang <ee07b...@gmail.com> wrote:

> Also, i'm building from
>
> ```
> commit 18ca3922e688a3a730d693ff8f2cfbfd65da8c46
> Author: Uwe L. Korn <u...@apache.org>
> Date:   Sun Sep 17 13:55:43 2017 -0400
>
> PARQUET-1094: Add benchmark for boolean Arrow column I/O
>
> Author: Uwe L. Korn <u...@apache.org>
>
> Closes #391 from xhochy/PARQUET-1094 and squashes the following
> commits:
>
> 089bb3c [Uwe L. Korn] PARQUET-1094: Add benchmark for boolean Arrow
> column I/O
> ```
>
> On 5 January 2018 at 20:27, ALeX Wang <ee07b...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm referring to the https://github.com/apache/
>> parquet-cpp/blob/master/ci/travis_script_static.sh
>> and try to build static library,
>>
>> My cmake cmdline looks like:
>> ```
>> cmake3 -DCMAKE_BUILD_TYPE=Release -DPARQUET_BUILD_EXAMPLES=OFF
>> -DPARQUET_BUILD_TESTS=OFF  -DPARQUET_ARROW_LINKAGE="static"
>>  -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF  .
>> ```
>>
>> However, the compilation failed with:
>> ```
>> [ 90%] Linking CXX executable ../build/release/parquet-scan
>> ../build/release/libarrow.a(file.cc.o): In function
>> `arrow::io::FileOutputStream::Open(std::string const&, bool,
>> std::shared_ptr*)':
>> file.cc:(.text+0x2f1f): undefined reference to
>> `boost::filesystem::path::codecvt()'
>> ../build/release/libarrow.a(file.cc.o): In function
>> `arrow::io::ReadableFile::Open(std::string const&, arrow::MemoryPool*,
>> std::shared_ptr*)':
>> file.cc:(.text+0x33d0): undefined reference to
>> `boost::filesystem::path::codecvt()'
>> ../build/release/libarrow.a(file.cc.o): In function
>> `arrow::io::ReadableFile::Open(std::string const&,
>> std::shared_ptr*)':
>> file.cc:(.text+0x3865): undefined reference to
>> `boost::filesystem::path::codecvt()'
>> ../build/release/libarrow.a(file.cc.o): In function
>> `arrow::io::MemoryMappedFile::MemoryMap::Open(std::string const&,
>> arrow::io::FileMode::type)':
>> file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4Open
>> ERKSsNS0_8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9Memo
>> ryMap4OpenERKSsNS0_8FileMode4typeE]+0xba): undefined reference to
>> `boost::filesystem::path::codecvt()'
>> file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4Open
>> ERKSsNS0_8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9Memo
>> ryMap4OpenERKSsNS0_8FileMode4typeE]+0x186): undefined reference to
>> `boost::filesystem::path::codecvt()'
>> collect2: error: ld returned 1 exit status
>> ```
>>
>> Any idea where I did wrong?
>>
>> Thanks,
>> --
>> Alex Wang,
>> Open vSwitch developer
>>
>
>
>
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer


Re: Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-05 Thread ALeX Wang
Also, i'm building from

```
commit 18ca3922e688a3a730d693ff8f2cfbfd65da8c46
Author: Uwe L. Korn <u...@apache.org>
Date:   Sun Sep 17 13:55:43 2017 -0400

PARQUET-1094: Add benchmark for boolean Arrow column I/O

Author: Uwe L. Korn <u...@apache.org>

Closes #391 from xhochy/PARQUET-1094 and squashes the following commits:

089bb3c [Uwe L. Korn] PARQUET-1094: Add benchmark for boolean Arrow
column I/O
```

On 5 January 2018 at 20:27, ALeX Wang <ee07b...@gmail.com> wrote:

> Hi,
>
> I'm referring to the https://github.com/apache/parquet-cpp/blob/master/ci/
> travis_script_static.sh
> and try to build static library,
>
> My cmake cmdline looks like:
> ```
> cmake3 -DCMAKE_BUILD_TYPE=Release -DPARQUET_BUILD_EXAMPLES=OFF
> -DPARQUET_BUILD_TESTS=OFF  -DPARQUET_ARROW_LINKAGE="static"
>  -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF  .
> ```
>
> However, the compilation failed with:
> ```
> [ 90%] Linking CXX executable ../build/release/parquet-scan
> ../build/release/libarrow.a(file.cc.o): In function
> `arrow::io::FileOutputStream::Open(std::string const&, bool,
> std::shared_ptr*)':
> file.cc:(.text+0x2f1f): undefined reference to `boost::filesystem::path::
> codecvt()'
> ../build/release/libarrow.a(file.cc.o): In function
> `arrow::io::ReadableFile::Open(std::string const&, arrow::MemoryPool*,
> std::shared_ptr*)':
> file.cc:(.text+0x33d0): undefined reference to `boost::filesystem::path::
> codecvt()'
> ../build/release/libarrow.a(file.cc.o): In function
> `arrow::io::ReadableFile::Open(std::string const&,
> std::shared_ptr*)':
> file.cc:(.text+0x3865): undefined reference to `boost::filesystem::path::
> codecvt()'
> ../build/release/libarrow.a(file.cc.o): In function
> `arrow::io::MemoryMappedFile::MemoryMap::Open(std::string const&,
> arrow::io::FileMode::type)':
> file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_
> 8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE]+0xba):
> undefined reference to `boost::filesystem::path::codecvt()'
> file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_
> 8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE]+0x186):
> undefined reference to `boost::filesystem::path::codecvt()'
> collect2: error: ld returned 1 exit status
> ```
>
> Any idea where I did wrong?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer


Cannot build static library, `undefined reference to `boost::filesystem::path::codecvt()'`

2018-01-05 Thread ALeX Wang
Hi,

I'm referring to the
https://github.com/apache/parquet-cpp/blob/master/ci/travis_script_static.sh
and try to build static library,

My cmake cmdline looks like:
```
cmake3 -DCMAKE_BUILD_TYPE=Release -DPARQUET_BUILD_EXAMPLES=OFF
-DPARQUET_BUILD_TESTS=OFF  -DPARQUET_ARROW_LINKAGE="static"
 -DPARQUET_BUILD_SHARED=OFF -DPARQUET_BOOST_USE_SHARED=OFF  .
```

However, the compilation failed with:
```
[ 90%] Linking CXX executable ../build/release/parquet-scan
../build/release/libarrow.a(file.cc.o): In function
`arrow::io::FileOutputStream::Open(std::string const&, bool,
std::shared_ptr*)':
file.cc:(.text+0x2f1f): undefined reference to
`boost::filesystem::path::codecvt()'
../build/release/libarrow.a(file.cc.o): In function
`arrow::io::ReadableFile::Open(std::string const&, arrow::MemoryPool*,
std::shared_ptr*)':
file.cc:(.text+0x33d0): undefined reference to
`boost::filesystem::path::codecvt()'
../build/release/libarrow.a(file.cc.o): In function
`arrow::io::ReadableFile::Open(std::string const&,
std::shared_ptr*)':
file.cc:(.text+0x3865): undefined reference to
`boost::filesystem::path::codecvt()'
../build/release/libarrow.a(file.cc.o): In function
`arrow::io::MemoryMappedFile::MemoryMap::Open(std::string const&,
arrow::io::FileMode::type)':
file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE]+0xba):
undefined reference to `boost::filesystem::path::codecvt()'
file.cc:(.text._ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE[_ZN5arrow2io16MemoryMappedFile9MemoryMap4OpenERKSsNS0_8FileMode4typeE]+0x186):
undefined reference to `boost::filesystem::path::codecvt()'
collect2: error: ld returned 1 exit status
```

Any idea where I did wrong?

Thanks,
-- 
Alex Wang,
Open vSwitch developer


Re: What is the correct way to read 1-Dimension bytearray array

2017-12-29 Thread ALeX Wang
Thx for making my day !~ ;D



On 29 December 2017 at 14:47, Wes McKinney <wesmck...@gmail.com> wrote:

> Also, I think you can use the `Scanner` / `TypedScanner` APIs to do
> precisely this:
>
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/column_scanner.h#L88
>
> These do what the API I described in pseudocode does -- didn't
> remember it soon enough for my e-mail.
>
> On Fri, Dec 29, 2017 at 4:11 PM, ALeX Wang <ee07b...@gmail.com> wrote:
> > Hi Wes,
> >
> > Thanks a lot for your reply, I'll try something as you suggested,
> >
> > Thanks,
> > Alex Wang,
> >
> > On 29 December 2017 at 11:40, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> hi Alex,
> >>
> >> I would suggest that you handle batch buffering on the application
> >> side, _not_ calling ReadBatch(1, ...) which will be much slower -- the
> >> parquet-cpp APIs are intended to be used for batch read and writes, so
> >> if you need to read a table row by row, you could create some C++
> >> classes with a particular batch size that manage an internal buffer of
> >> values that have been read from the column.
> >>
> >> As an example, suppose you wish to buffer 1000 values from the column
> >> at a time. Then you could create an API that looks like:
> >>
> >> BufferedColumnReader buffered_reader(batch_reader);
> >> buffered_reader.set_batch_size(1000);
> >>
> >> const ByteArray* val;
> >> while (val = buffered_reader.Next()) {
> >>   // Do something with val
> >> }
> >>
> >> The ByteArray values do not own their data, so if you wish to persist
> >> the memory between (internal) calls to ReadBatch, you will have to
> >> copy the memory someplace else. We do not perform this copy for you in
> >> the low level ReadBatch API because it would hurt performance for
> >> users who wish to put the memory someplace else (like in an Arrow
> >> columnar array buffer)
> >>
> >> I recommend looking at the Apache Arrow-based reader API which does
> >> all this for you including memory management.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Thu, Dec 28, 2017 at 12:05 PM, ALeX Wang <ee07b...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > Assume the column type is of 1-Dimension ByteArray array, (definition
> >> level
> >> > - 1, and repetition - repeated).
> >> >
> >> >
> >> > If I want to read the column values one row at a time, I have to keep
> >> read
> >> > (i.e.
> >> > calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At
> that
> >> > point, I can
> >> > construct previously read ByteArrays and return it as for the row.
> >> >
> >> > However, since 'ByteArray->ptr' points to the column page memory which
> >> > (based
> >> > on my understanding)  will be gone when calling 'HasNext()' and move
> to
> >> the
> >> > next
> >> > page.  So that means i have to maintain a copy of the 'ByteArray->ptr'
> >> for
> >> > all the
> >> > previously read values.
> >> >
> >> > This really seems to me to be too complicated..
> >> > Would like to ask if there is a better way of doing:
> >> >1. Reading 1D array in row-by-row fashion.
> >> >2. Zero-copy 'ByteArray->ptr'
> >> >
> >> > Thanks a lot,
> >> > --
> >> > Alex Wang,
> >> > Open vSwitch developer
> >>
> >
> >
> >
> > --
> > Alex Wang,
> > Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer


What is the correct way to read 1-Dimension bytearray array

2017-12-28 Thread ALeX Wang
Hi,

Assume the column type is of 1-Dimension ByteArray array, (definition level
- 1, and repetition - repeated).


If I want to read the column values one row at a time, I have to keep read
(i.e.
calling ReadBatch(1,...)) until getting a value of 'rep_level=0'. At that
point, I can
construct previously read ByteArrays and return it as for the row.

However, since 'ByteArray->ptr' points to the column page memory which
(based
on my understanding)  will be gone when calling 'HasNext()' and move to the
next
page.  So that means i have to maintain a copy of the 'ByteArray->ptr' for
all the
previously read values.

This really seems to me to be too complicated..
Would like to ask if there is a better way of doing:
   1. Reading 1D array in row-by-row fashion.
   2. Zero-copy 'ByteArray->ptr'

Thanks a lot,
-- 
Alex Wang,
Open vSwitch developer