[
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585288#comment-16585288
]
Jonathan Underwood commented on PARQUET-1241:
---------------------------------------------
I think there's a danger here of misunderstanding, as a few different things
are being discussed. There are 3 possible implementations:
# The current arrow implementation which uses the LZ4 +block+ storage format
for column data, with the size of the +_uncompressed_+ data stored as external
column metadata.
# The current Hadoop LZ4 implementation which uses the LZ4 +block+ storage
format with the size of the +_compressed_+ and the _+uncompressed+_ data
prepended to the column data as 8 bytes of extra data.
# A proposed new storage format which would use the LZ4 +frame+ format to
store the compressed column data. The frame format allows for the size of the
uncompressed data to be stored internally to the compressed frame as frame
metadata, but does not require this - a decision would be needed as to whether
the LZ4 frame compressed column data should include the size of the
uncompressed data, the checksum of the data etc. Those extra pieces of data are
currently already stored as column metadata, so this would be a duplication of
information, but to no harm.
> Use LZ4 frame format
> --------------------
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-cpp, parquet-format
> Reporter: Lawrence Chan
> Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data
> should be framed or not. We should choose one and make it explicit in the
> spec, as they are not inter-operable. After some discussions with others [1],
> we think it would be beneficial to use the framed format, which adds a small
> header in exchange for more self-contained decompression as well as a richer
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)