[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965359#comment-16965359
 ] 

Martin Radev commented on PARQUET-1241:
---------------------------------------

> Can you spell out the reasoning in more detail?
 Yes.
 The magic number is: 0x184D2204U 
([https://android.googlesource.com/platform/external/lz4/+/HEAD/doc/lz4_Frame_format.md]
 , [https://github.com/lz4/lz4/blob/dev/lib/lz4frame.c#L210] )
 From the block format, I cite:
 "An LZ4 compressed block is composed of sequences."
 "Each sequence starts with a token ... one byte ... separated into two 4-bits 
fields"
 "The first field uses the 4 high-bits of the token. It provides the length of 
literals to follow.." (here there is some extra stuff on the meaning of a given 
value but not relevant, 0x0 means no literals, 0x1 means one literal)
 "Following token and optional length bytes, are the literals themselves. "
 "Following the literals is the match copy operation.", "It starts by the 
offset. This is a 2 bytes value,", "The offset represents the position of the 
match to be copied from. 1 means “current position - 1 byte”. "

So, let's suppose 0x184D2204U is this.
This means, 0 is the nibble which corresponds to the length of literals. So, 
the next two bytes are directly the offset.
The offset is 0x4D22. Since we are still at position 0, we cannot be having a 
copy from 0 - 0x4D22. Thus, this would be an invalid block encoding and the 
decompressor should catch the underflow.

I hope I'm interpreting the spec correctly.

 

 

> [C++] Use LZ4 frame format
> --------------------------
>
>                 Key: PARQUET-1241
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1241
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp, parquet-format
>            Reporter: Lawrence Chan
>            Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to