marioloko commented on issue #2988: URL: https://github.com/apache/arrow-rs/issues/2988#issuecomment-1297750539
*Short answer*: Yes, probably it is possible to silently fail and be decoded by the wrong codec, but unlikely. So another option would be to only allow `LZ4_HADOOP` and allow fallback if `backward_compatible_lz4` feature is enabled. *Long Answer*: After this change we should only generate `LZ4_HADOOP` for LZ4 `CompressionCodec`. So the correct implementation should be tried first, and after this commit that will be the most likely. So the the files arranged by likelihood of hit by this library will be: 1. LZ4_HADOOP 2. LZ4_FRAME (if files generated by older version) 3. LZ4_RAW (if interoperating with older parquet C++ version) Then most of the times we expect the files to be of type `LZ4_HADOOP`. These files have the following format: - 4 Big endian bytes representing decompressed block size - 4 Big endian bytes representing uncompressed block size - Compressed bytes We can find the implementation in Rust in pola-rs crate at [compression.rs](https://pola-rs.github.io/polars/src/parquet2/compression.rs.html#225) file. For this algorithm we know can easily detect if there is an error when decoding because it is unlikely that the input and output length match for every block. Then, the second in likelihood is `LZ4_FRAME` for files generated by this library previously. These files have the following format: - A magic number at the beginning of the frame. - A frame descriptor. - Data blocks with checksum. - End Marker To meet be decoded by this codec if the data is not in `LZ4_FRAME` is really, really unlikely, as it should match the magic number and the checksum of each block. So now we fall to the last in probability `LZ4_RAW`, and this one is probably the most problematic. To ensure that the file is not corrupted, it does the following checks: 1. The last sequence contains only literals. The block ends after the literals. 2. The last five bytes of input are only literals. Therefore, the last sequence contains at least five bytes. 3. The last match must start at least 12 bytes before the end of block Even if it is not likely to be decompress by this algorithm, maybe it is the one with more relaxed rules. However, we know the uncompressed size, so the output should fit in that exact length, which is an additional check of integrity. Parque C++ has this fallback to `LZ4_RAW` and they have [this discussion](https://github.com/apache/arrow/pull/7789). In that, discussion they mention that `LZ4_RAW` tends to fail quickly when the first sequence read fails. [This comment](https://github.com/apache/arrow/pull/7789#issuecomment-660341488) has the same concern you have. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
