Kevin created ARROW-14422:
-----------------------------

             Summary: parquet saved by pyarrow, cannot be read in Hive
                 Key: ARROW-14422
                 URL: https://issues.apache.org/jira/browse/ARROW-14422
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Kevin


I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
to save it as parquet on disk (parquet version 1.4)

colums
 id : string
 val : string

table = pa.Table.from_pandas(df) 
 pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
write_statistics=True, )

However, Hive and Spark does not recognize the parquet version:

{{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) 
)?\(build ?(.*)\)}}
{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
{{ at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
{{ at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}

 

It seems related to this issue:

 

It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow 
was even started. The underlying C++ code does allow this {{created_by}} field 
to be customized 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
 but the python wrapper does not expose this 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
 

 

It would be nice that pyarrow exposes this feature.

 

SO Question here:
[https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to