[jira] [Commented] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

Joris Van den Bossche (Jira) Fri, 22 Oct 2021 06:19:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432973#comment-17432973
 ]


Joris Van den Bossche commented on ARROW-14422:
-----------------------------------------------

In principle, there should be no need to expose this in Python, since you can't 
actually influence who is creating the file. Of course, if that value gives 
problems in other software, that could be a reason. But then we should maybe 
rather consider changing that value in C++. But, this is something we actually 
already did recently in the 4.0 release (ARROW-7830). So it might be that 
updating your pyarrow version could also fix the issue.

> [Python] Allow parquet::WriterProperties::created_by to be set via 
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14422
>                 URL: https://issues.apache.org/jira/browse/ARROW-14422
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Kevin
>            Priority: Major
>
> have a couple of files  and using  pyarrow.table (0.17)
>  to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> *table = pa.Table.from_pandas(df)* 
>  *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version 
> ((.*) )?(build ?(.*))}}
>  \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
>  \{{ at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
>  \{{ at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before 
> Arrow was even started. The underlying C++ code does allow this 
> {{created_by}} field to be customized 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
>  but the python wrapper does not expose this 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>  
>   
> *+EDIT Add infos+*
> Current python wrapper does NOT expose :  created_by builder  (when writing 
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>  
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>  
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
> file:
>  
>  
> +*SO Question here:*+
>  
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

Reply via email to