[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585285#comment-16585285
 ] 

Alex Wang edited comment on PARQUET-1241 at 8/19/18 11:26 PM:
--------------------------------------------------------------

[~wesmckinn] sorry for this delayed replay,

 

-I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.-  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

 

Sorry on second thought I meant to add LZ4 compressor (which uses open source 
github/lz4-java) to parquet-mr like the SnappyCompressor.java 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyCompressor.java]

 

The reason being even if I added a new lz4 hadoop codec to arrow/cpp, the 
parquet-mr still writes the hadoop LZ4 format and sets the compression type to 
LZ4 in the file's metadata.

 


was (Author: ee07b291):
[~wesmckinn] sorry for this delayed replay,

 

I'd like to add a lz4-hadoop(framed) format to arrow which aligns with my work 
interest.  For official LZ4-framed, I'd like to help with that as well but 
depends on my work schedule.

> Use LZ4 frame format
> --------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to