[GitHub] [arrow] chairmank commented on issue #3491: parquet lz4 interop with spark appears broken

GitBox Thu, 18 Jun 2020 06:26:22 -0700


chairmank commented on issue #3491:
URL: https://github.com/apache/arrow/issues/3491#issuecomment-646015745



   I believe that 
[PARQUET-1241](https://issues.apache.org/jira/browse/PARQUET-1241) ("[C++] Use 
LZ4 frame format") does not directly address the issue that was reported here, 
although there is relevant discussion in the comments (like 
[this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328)
 and 
[this](https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288)).
   
   The stack trace in the bug report shows an exception thrown by the 
[Spark](https://github.com/apache/spark) class 
`org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader`, 
which uses the [parquet-mr](https://github.com/apache/parquet-mr) class 
`org.apache.parquet.hadoop.ParquetFileReader`, which uses the 
[Hadoop](https://github.com/apache/hadoop) 
`org.apache.hadoop.io.compress.Lz4Codec` class.
   
   As discussed in 
[HADOOP-12990](https://issues.apache.org/jira/browse/HADOOP-12990), the Hadoop 
`Lz4Codec` uses the lz4 block format, and it prepends 8 extra bytes before the 
compressed data. I believe that lz4 implementation used by `pyarrow.parquet` 
also uses the lz4 block format, but it does not prepend these 8 extra bytes. 
Reconciling this incompatibility does not require implementing the framed 
format.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] chairmank commented on issue #3491: parquet lz4 interop with spark appears broken

Reply via email to