[jira] [Updated] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

Kevin (Jira) Thu, 21 Oct 2021 18:04:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kevin updated ARROW-14422:
--------------------------
    Description: 
have a couple of files  and using  pyarrow.table (0.17)
 to save it as parquet on disk (parquet version 1.4)

colums
 id : string
 val : string

*table = pa.Table.from_pandas(df)* 
 *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
write_statistics=True, )*

However, Hive and Spark does not recognize the parquet version:

{{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) 
)?(build ?(.*))}}
 \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
 \{{ at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
 \{{ at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}

 

+*It seems related to this issue:*+

It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow 
was even started. The underlying C++ code does allow this {{created_by}} field 
to be customized 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
 but the python wrapper does not expose this 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
 

  

*+EDIT Add infos+*

Current python wrapper does NOT expose :  created_by builder  (when writing 
parquet on disk)

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]

 

But, this is available in CPP version:

[https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]

 

This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
file:

 

 

+*SO Question here:*+
 
[https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]

 

  was:
I have a couple of files (csv,..) and am using pandas and pyarrow.table (0.17)
 to save it as parquet on disk (parquet version 1.4)

colums
 id : string
 val : string

table = pa.Table.from_pandas(df) 
 pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
write_statistics=True, )

However, Hive and Spark does not recognize the parquet version:

{{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version ((.*) 
)?(build ?(.*))}}
 \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
 \{{ at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
 \{{ at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}

 

+*It seems related to this issue:*+

It appears you've encountered PARQUET-349 which was fixed in 2015 before Arrow 
was even started. The underlying C++ code does allow this {{created_by}} field 
to be customized 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
 but the python wrapper does not expose this 
[source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
 

  

*+EDIT Add infos from SO+*

 

Current python wrapper does NOT expose :  created_by builder  (when writing 
parquet on disk)

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]

 

But, this is available in CPP version:

[https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]

 

This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
file:

 

 

 

 

 

 

 

+*SO Question here:*+
 
[https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]

 


> [Python] Allow parquet::WriterProperties::created_by to be set via 
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14422
>                 URL: https://issues.apache.org/jira/browse/ARROW-14422
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Kevin
>            Priority: Major
>
> have a couple of files  and using  pyarrow.table (0.17)
>  to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> *table = pa.Table.from_pandas(df)* 
>  *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version 
> ((.*) )?(build ?(.*))}}
>  \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
>  \{{ at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
>  \{{ at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before 
> Arrow was even started. The underlying C++ code does allow this 
> {{created_by}} field to be customized 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
>  but the python wrapper does not expose this 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>  
>   
> *+EDIT Add infos+*
> Current python wrapper does NOT expose :  created_by builder  (when writing 
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>  
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>  
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
> file:
>  
>  
> +*SO Question here:*+
>  
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

Reply via email to