[
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin updated ARROW-14422:
--------------------------
Description:
I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
to save it as parquet on disk (parquet version 1.4)
colums
id : string
val : string
table = pa.Table.from_pandas(df)
pq.write_table(table, "df.parquet", version='1.0', flavor='spark',
write_statistics=True, )
However, Hive and Spark does not recognize the parquet version:
{{org.apache.parquet.VersionParser$VersionParseException: Could not parse
created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*)
)?(build ?(.*))}}
\{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
\{{ at
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
\{{ at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
+*It seems related to this issue:*+
It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow
was even started. The underlying C++ code does allow this {{created_by}} field
to be customized
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
but the python wrapper does not expose this
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
*+EDIT Add infos from SO+*
Current python wrapper does NOT expose : created_by builder (when writing
parquet on disk)
[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
But, this is available in CPP version:
[https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
This creates an issue when Hadoop parquet reader reads this pyarrow parquet
file:
+*SO Question here:*+
[https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
was:
I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
to save it as parquet on disk (parquet version 1.4)
colums
id : string
val : string
table = pa.Table.from_pandas(df)
pq.write_table(table, "df.parquet", version='1.0', flavor='spark',
write_statistics=True, )
However, Hive and Spark does not recognize the parquet version:
{{org.apache.parquet.VersionParser$VersionParseException: Could not parse
created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*)
)?\(build ?(.*)\)}}
{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
{{ at
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
{{ at
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
It seems related to this issue:
It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow
was even started. The underlying C++ code does allow this {{created_by}} field
to be customized
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
but the python wrapper does not expose this
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
It would be nice that pyarrow exposes this feature.
SO Question here:
[https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
> [Python] Allow parquet::WriterProperties::created_by to be set via
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14422
> URL: https://issues.apache.org/jira/browse/ARROW-14422
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Kevin
> Priority: Major
>
> I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
> to save it as parquet on disk (parquet version 1.4)
> colums
> id : string
> val : string
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "df.parquet", version='1.0', flavor='spark',
> write_statistics=True, )
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version
> ((.*) )?(build ?(.*))}}
> \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
> \{{ at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
> \{{ at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before
> Arrow was even started. The underlying C++ code does allow this
> {{created_by}} field to be customized
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> but the python wrapper does not expose this
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>
>
> *+EDIT Add infos from SO+*
>
> Current python wrapper does NOT expose : created_by builder (when writing
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet
> file:
>
>
>
>
>
>
>
> +*SO Question here:*+
>
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)