chairmank commented on issue #3491:
URL: https://github.com/apache/arrow/issues/3491#issuecomment-646015745


   I believe that 
[PARQUET-1241](https://issues.apache.org/jira/browse/PARQUET-1241) ("[C++] Use 
LZ4 frame format") does not directly address the issue that was reported here, 
although there is relevant discussion in the comments (like 
[this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328)
 and 
[this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288)).
   
   The stack trace in the bug report shows an exception thrown by the 
[Spark](https://github.com/apache/spark) class 
`org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader`, 
which uses the [parquet-mr](https://github.com/apache/parquet-mr) class 
`org.apache.parquet.hadoop.ParquetFileReader`, which uses the 
[Hadoop](https://github.com/apache/hadoop) 
`org.apache.hadoop.io.compress.Lz4Codec` class.
   
   As discussed in 
[HADOOP-12990](https://issues.apache.org/jira/browse/HADOOP-12990), the Hadoop 
`Lz4Codec` uses the lz4 block format, and it prepends 8 extra bytes before the 
compressed data. I believe that lz4 implementation used by `pyarrow.parquet` 
also uses the lz4 block format, but it does not prepend these 8 extra bytes. 
Reconciling this incompatibility does not require implementing the framed 
format.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to