[jira] [Closed] (ARROW-6057) [Python] Parquet files v2.0 created by spark can't be read by pyarrow

Wes McKinney (Jira) Fri, 19 Feb 2021 19:56:07 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-6057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney closed ARROW-6057.
-------------------------------
    Resolution: Cannot Reproduce

Can't reproduce. If you can provide instructions to reproduce someone can look

> [Python] Parquet files v2.0 created by spark can't be read by pyarrow
> ---------------------------------------------------------------------
>
>                 Key: ARROW-6057
>                 URL: https://issues.apache.org/jira/browse/ARROW-6057
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.14.1
>            Reporter: Vladyslav Shamaida
>            Priority: Major
>              Labels: parquet
>
> PyArrow uses footer metadata to determine the format version of parquet file, 
> while parquet-mr lib (which is used by spark) determines version on the page 
> level by page header type. Moreover in ParquetFileWriter parquet-mr hardcodes 
> version in footer to '1'. See: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L913]
> Thus, spark can write and read its own written files, pyarrow can write and 
> read its own written files, but when pyarrow tries to read file of version 
> 2.0, which was written by spark it throws an error about malformed file 
> (because it thinks that format version is 1.0).
> Depending on the compression method an error is:
> - _Corrupt snappy compressed data_
> - _GZipCodec failed: incorrect header check_
> - _ArrowIOError: Unknown encoding type_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-6057) [Python] Parquet files v2.0 created by spark can't be read by pyarrow

Reply via email to