[jira] [Commented] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes

Wes McKinney (JIRA) Wed, 15 Mar 2017 11:44:58 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926744#comment-15926744
 ]


Wes McKinney commented on ARROW-539:
------------------------------------

Yes -- I think the most performant / robust option will be to generate 
{{DictionaryArray}} fields from the partition keys. 

For example, if we have 3 partitions with the keys "a", "b", and "c", then we 
will a pyarrow.Table from each file and add DictionaryArray columns for the 
partition keys. We have to determine all the partition keys up front so that we 
can produce correct dictionary metadata, so it might be that

{code}
a -> 0
b -> 1
c -> 2
{code}

So in the first table for partition "a", the dictionary indices are all 0. But 
we can concatenate and then convert to pandas.Categorical at the end

> [Python] Support reading Parquet datasets with standard partition directory 
> schemes
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-539
>                 URL: https://issues.apache.org/jira/browse/ARROW-539
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>         Attachments: partitioned_parquet.tar.gz
>
>
> Currently, we only support multi-file directories with a flat structure 
> (non-partitioned). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes

Reply via email to