Lawrence Chan created PARQUET-1241:
--------------------------------------
Summary: Use LZ4 frame format
Key: PARQUET-1241
URL: https://issues.apache.org/jira/browse/PARQUET-1241
Project: Parquet
Issue Type: Improvement
Components: parquet-cpp, parquet-format
Reporter: Lawrence Chan
The parquet-format spec doesn't currently specify whether lz4-compressed data
should be framed or not. We should choose one and make it explicit in the spec,
as they are not inter-operable. After some discussions with others [1], we
think it would be beneficial to use the framed format, which adds a small
header in exchange for more self-contained decompression as well as a richer
feature set (checksums, parallel decompression, etc).
The current arrow implementation compresses using the lz4 block format, and
this would need to be updated when we add the spec clarification.
If backwards compatibility is a concern, I would suggest adding an additional
LZ4_FRAMED compression type, but that may be more noise than anything.
[1] https://github.com/dask/fastparquet/issues/314
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)