[ 
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433398#comment-17433398
 ] 

Micah Kornfield commented on ARROW-14422:
-----------------------------------------

{quote}Maintaining some regression test
between pyarron export and “parquet-mr” maybbebuseful
{quote}
Agreed, there was some proposals but it appears no one has had time to devote 
to this.  I'm not sure in this case since as Weston pointed out, the version of 
broken parquet and we would likely only test a few versions.

 
{quote}I agree that adding the word "build" to the C++ created_by string would 
be another way to solve this issue. We could change "parquet-cpp-arrow version 
6.0.0-SNAPSHOT" to "parquet-cpp-arrow build 6.0.0-SNAPSHOT" but I don't know 
how I feel about that either.
{quote}
I'd be more in favor of of adding a build string to C++ then exposing the flag 
in python (at least if we expose the flag in python (or at least we would need 
to validate the flag in python to see if it is parseable.  In general, I think 
this is fairly low level so I'd be hesitant to expose it in more places.  Using 
the BUILD field to hold the SHA git hash could be interesting.

> [Python] Allow parquet::WriterProperties::created_by to be set via 
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14422
>                 URL: https://issues.apache.org/jira/browse/ARROW-14422
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Kevin
>            Priority: Major
>
> have a couple of files  and using  pyarrow.table (0.17)
>  to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> *table = pa.Table.from_pandas(df)* 
>  *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version 
> ((.*) )?(build ?(.*))}}
>  \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
>  \{{ at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
>  \{{ at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before 
> Arrow was even started. The underlying C++ code does allow this 
> {{created_by}} field to be customized 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
>  but the python wrapper does not expose this 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>  
>   
> *+EDIT Add infos+*
> Current python wrapper does NOT expose :  created_by builder  (when writing 
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>  
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>  
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
> file:
>  
>  
> +*SO Question here:*+
>  
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to