[GitHub] [arrow-rs] marioloko commented on issue #2988: CompressionCodec LZ4 incompatible with C++ implementation

GitBox Mon, 31 Oct 2022 15:11:27 -0700


marioloko commented on issue #2988:
URL: https://github.com/apache/arrow-rs/issues/2988#issuecomment-1297750539


   *Short answer*: Yes, probably it is possible to silently fail and be decoded 
by the wrong codec, but unlikely. So another option would be to only allow 
`LZ4_HADOOP` and allow fallback if `backward_compatible_lz4` feature is enabled.
   
   *Long Answer*: After this change we should only generate `LZ4_HADOOP` for 
LZ4 `CompressionCodec`. So the correct implementation should be tried first, 
and after this commit that will be the most likely.
   
   So the the files arranged by likelihood of hit by this library will be:
   1. LZ4_HADOOP
   2. LZ4_FRAME (if files generated by older version)
   3. LZ4_RAW (if interoperating with older parquet C++ version)
   
   Then most of the times we expect the files to be of type `LZ4_HADOOP`. These 
files have the following format:
   - 4 Big endian bytes representing decompressed block size
   - 4 Big endian bytes representing uncompressed block size
   - Compressed bytes
   We can find the implementation in Rust in pola-rs crate at 
[compression.rs](https://pola-rs.github.io/polars/src/parquet2/compression.rs.html#225)
 file.
   
   For this algorithm we know can easily detect if there is an error when 
decoding because it is unlikely that the input and output length match for 
every block.
   
   Then, the second in likelihood is `LZ4_FRAME` for files generated by this 
library previously.
   These files have the following format:
   - A magic number at the beginning of the frame.
   - A frame descriptor.
   - Data blocks with checksum.
   - End Marker
   
   To meet be decoded by this codec if the data is not in `LZ4_FRAME` is 
really, really unlikely, as it should match the magic number and the checksum 
of each block.
   
   So now we fall to the last in probability `LZ4_RAW`, and this one is 
probably the most problematic.
   To ensure that the file is not corrupted, it does the following checks:
   1. The last sequence contains only literals. The block ends after the 
literals.
   2. The last five bytes of input are only literals. Therefore, the last 
sequence contains at least five bytes.
   3. The last match must start at least 12 bytes before the end of block
   
   Even if it is not likely to be decompress by this algorithm, maybe it is the 
one with more relaxed rules. However, we know the uncompressed size, so the 
output should fit in that exact length, which is an additional check of 
integrity.
   
   Parque C++ has this fallback to  `LZ4_RAW` and they have [this 
discussion](https://github.com/apache/arrow/pull/7789). In that, discussion 
they mention that `LZ4_RAW` tends to fail quickly when the first sequence read 
fails. 
   
   [This 
comment](https://github.com/apache/arrow/pull/7789#issuecomment-660341488) has 
the same concern you have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] marioloko commented on issue #2988: CompressionCodec LZ4 incompatible with C++ implementation

Reply via email to