[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

Apache Arrow JIRA Bot (Jira) Thu, 13 Oct 2022 10:52:36 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617239#comment-17617239
 ]


Apache Arrow JIRA Bot commented on ARROW-2659:
----------------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Python] More graceful reading of empty String columns in ParquetDataset
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2659
>                 URL: https://issues.apache.org/jira/browse/ARROW-2659
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.9.0
>            Reporter: Uwe Korn
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>         Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

Reply via email to