[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394208#comment-16394208
 ] 

Jonathan Underwood commented on PARQUET-1241:
---------------------------------------------

This seems like a good idea from an interoperability perspective - the goal of 
the Frame format was to ensure interoperability across different bindings and 
implementations of LZ4.

However it's interesting to note that the Frame format header allows the 
following to be stored in the frame:
 * The uncompressed data size (in the frame header)
 * A checksum of the uncompressed data (in the frame footer)

Neither of these are required (so can be omitted). I mention this because 
storing those two things would duplicate information already stored in the 
column metadata, externally to the compressed data. So, it may be preferable to 
specify that these two things shouldn't be stored in the lz4frame compressed 
chunks to avoid redundancy. On the other hand, storing these two things in the 
frame header/footer does no harm, and storing the first item allows for optimal 
buffer sizing during decompression.

> Use LZ4 frame format
> --------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to