[
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433398#comment-17433398
]
Micah Kornfield commented on ARROW-14422:
-----------------------------------------
{quote}Maintaining some regression test
between pyarron export and “parquet-mr” maybbebuseful
{quote}
Agreed, there was some proposals but it appears no one has had time to devote
to this. I'm not sure in this case since as Weston pointed out, the version of
broken parquet and we would likely only test a few versions.
{quote}I agree that adding the word "build" to the C++ created_by string would
be another way to solve this issue. We could change "parquet-cpp-arrow version
6.0.0-SNAPSHOT" to "parquet-cpp-arrow build 6.0.0-SNAPSHOT" but I don't know
how I feel about that either.
{quote}
I'd be more in favor of of adding a build string to C++ then exposing the flag
in python (at least if we expose the flag in python (or at least we would need
to validate the flag in python to see if it is parseable. In general, I think
this is fairly low level so I'd be hesitant to expose it in more places. Using
the BUILD field to hold the SHA git hash could be interesting.
> [Python] Allow parquet::WriterProperties::created_by to be set via
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14422
> URL: https://issues.apache.org/jira/browse/ARROW-14422
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Kevin
> Priority: Major
>
> have a couple of files and using pyarrow.table (0.17)
> to save it as parquet on disk (parquet version 1.4)
> colums
> id : string
> val : string
> *table = pa.Table.from_pandas(df)*
> *pq.write_table(table, "df.parquet", version='1.0', flavor='spark',
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version
> ((.*) )?(build ?(.*))}}
> \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
> \{{ at
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
> \{{ at
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before
> Arrow was even started. The underlying C++ code does allow this
> {{created_by}} field to be customized
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> but the python wrapper does not expose this
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>
>
> *+EDIT Add infos+*
> Current python wrapper does NOT expose : created_by builder (when writing
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet
> file:
>
>
> +*SO Question here:*+
>
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)