[
https://issues.apache.org/jira/browse/ARROW-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney closed ARROW-6057.
-------------------------------
Resolution: Cannot Reproduce
Can't reproduce. If you can provide instructions to reproduce someone can look
> [Python] Parquet files v2.0 created by spark can't be read by pyarrow
> ---------------------------------------------------------------------
>
> Key: ARROW-6057
> URL: https://issues.apache.org/jira/browse/ARROW-6057
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 0.14.1
> Reporter: Vladyslav Shamaida
> Priority: Major
> Labels: parquet
>
> PyArrow uses footer metadata to determine the format version of parquet file,
> while parquet-mr lib (which is used by spark) determines version on the page
> level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes
> version in footer to '1'. See:
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]
> Thus, spark can write and read its own written files, pyarrow can write and
> read its own written files, but when pyarrow tries to read file of version
> 2.0, which was written by spark it throws an error about malformed file
> (because it thinks that format version is 1.0).
> Depending on the compression method an error is:
> - _Corrupt snappy compressed data_
> - _GZipCodec failed: incorrect header check_
> - _ArrowIOError: Unknown encoding type_
--
This message was sent by Atlassian Jira
(v8.3.4#803005)