Uwe L. Korn created ARROW-2659:
----------------------------------
Summary: [Python] More graceful reading of empty String columns in
ParquetDataset
Key: ARROW-2659
URL: https://issues.apache.org/jira/browse/ARROW-2659
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.9.0
Reporter: Uwe L. Korn
Fix For: 0.11.0
When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent
schemas, even if the source was a single DataFrame. This is due to the fact
that in some partitions object columns like string can become empty. Then the
resulting Arrow schema will differ. In the central metadata, we will store this
column as {{pa.string}} whereas in the partition file with the empty columns,
this columns will be stored as {{pa.null}}.
The two schemas are still a valid match in terms of schema evolution and we
should respect that in
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
Instead of doing a {{pa.Schema.equals}} in
https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
we should introduce a new method {{pa.Schema.can_evolve_to}} that is more
graceful and returns {{True}} if a dataset piece has a null column where the
main metadata states a nullable column of any type.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)