[
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16643666#comment-16643666
]
Antoine Pitrou commented on PARQUET-1241:
-----------------------------------------
For the record, in https://github.com/apache/arrow/pull/2696 I'm adding
streaming compression and decompression APIs to Arrow. For LZ4, I had to use
the framed format as it's the only simple way to stream-compress LZ4. Right
now, this means the streaming and one-shot APIs are incompatible for LZ4.
Making one-shot compression also use the framed format would be nicer :)
> Use LZ4 frame format
> --------------------
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp, parquet-format
> Reporter: Lawrence Chan
> Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data
> should be framed or not. We should choose one and make it explicit in the
> spec, as they are not inter-operable. After some discussions with others [1],
> we think it would be beneficial to use the framed format, which adds a small
> header in exchange for more self-contained decompression as well as a richer
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)