[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

Wes McKinney (JIRA) Wed, 09 Jan 2019 20:58:09 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739008#comment-16739008
 ]


Wes McKinney commented on ARROW-2659:
-------------------------------------

I'm moving this to 0.13 as unfortunately I don't think we have the time to do 
this properly for 0.12

I suggest we implement a couple of different things to help us:

* "Schema-normalized concatenate tables" -- perform safe casts and determine 
the merged schema for a collection of smaller tables, or attempt to safely cast 
tables to a fixed schema. As null will safely cast to anything this will solve 
the problem one way

* Additionally implement partitioned writes natively against Arrow tables 
without going through pandas, to avoid the issues in ARROW-2860

> [Python] More graceful reading of empty String columns in ParquetDataset
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2659
>                 URL: https://issues.apache.org/jira/browse/ARROW-2659
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Uwe L. Korn
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.13.0
>
>         Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

Reply via email to