[
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432973#comment-17432973
]
Joris Van den Bossche commented on ARROW-14422:
-----------------------------------------------
In principle, there should be no need to expose this in Python, since you can't
actually influence who is creating the file. Of course, if that value gives
problems in other software, that could be a reason. But then we should maybe
rather consider changing that value in C++. But, this is something we actually
already did recently in the 4.0 release (ARROW-7830). So it might be that
updating your pyarrow version could also fix the issue.
> [Python] Allow parquet::WriterProperties::created_by to be set via
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14422
> URL: https://issues.apache.org/jira/browse/ARROW-14422
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Kevin
> Priority: Major
>
> have a couple of files and using pyarrow.table (0.17)
> to save it as parquet on disk (parquet version 1.4)
> colums
> id : string
> val : string
> *table = pa.Table.from_pandas(df)*
> *pq.write_table(table, "df.parquet", version='1.0', flavor='spark',
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version
> ((.*) )?(build ?(.*))}}
> \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
> \{{ at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
> \{{ at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before
> Arrow was even started. The underlying C++ code does allow this
> {{created_by}} field to be customized
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> but the python wrapper does not expose this
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>
>
> *+EDIT Add infos+*
> Current python wrapper does NOT expose : created_by builder (when writing
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet
> file:
>
>
> +*SO Question here:*+
>
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)