[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585288#comment-16585288
 ] 

Jonathan Underwood commented on PARQUET-1241:
---------------------------------------------

I think there's a danger here of misunderstanding, as a few different things 
are being discussed. There are 3 possible implementations:
 # The current arrow implementation which uses the LZ4 +block+ storage format 
for column data, with the size of the +_uncompressed_+ data stored as external 
column metadata.
 # The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage 
format with the size of the +_compressed_+ and the _+uncompressed+_ data 
prepended to the column data as 8 bytes of extra data.
 # A proposed new storage format which would use the LZ4 +frame+ format to 
store the compressed column data. The frame format allows for the size of the 
uncompressed data to be stored internally to the compressed frame as frame 
metadata, but does not require this - a decision would be needed as to whether 
the LZ4 frame compressed column data should include the size of the 
uncompressed data, the checksum of the data etc. Those extra pieces of data are 
currently already stored as column metadata, so this would be a duplication of 
information, but to no harm.

> Use LZ4 frame format
> --------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to