Uwe L. Korn created ARROW-2659:
----------------------------------

             Summary: [Python] More graceful reading of empty String columns in 
ParquetDataset
                 Key: ARROW-2659
                 URL: https://issues.apache.org/jira/browse/ARROW-2659
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.9.0
            Reporter: Uwe L. Korn
             Fix For: 0.11.0


When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent 
schemas, even if the source was a single DataFrame. This is due to the fact 
that in some partitions object columns like string can become empty. Then the 
resulting Arrow schema will differ. In the central metadata, we will store this 
column as {{pa.string}} whereas in the partition file with the empty columns, 
this columns will be stored as {{pa.null}}.

The two schemas are still a valid match in terms of schema evolution and we 
should respect that in 
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
 Instead of doing a {{pa.Schema.equals}} in 
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
 we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
graceful and returns {{True}} if a dataset piece has a null column where the 
main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to