[ https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926744#comment-15926744 ]
Wes McKinney commented on ARROW-539: ------------------------------------ Yes -- I think the most performant / robust option will be to generate {{DictionaryArray}} fields from the partition keys. For example, if we have 3 partitions with the keys "a", "b", and "c", then we will a pyarrow.Table from each file and add DictionaryArray columns for the partition keys. We have to determine all the partition keys up front so that we can produce correct dictionary metadata, so it might be that {code} a -> 0 b -> 1 c -> 2 {code} So in the first table for partition "a", the dictionary indices are all 0. But we can concatenate and then convert to pandas.Categorical at the end > [Python] Support reading Parquet datasets with standard partition directory > schemes > ----------------------------------------------------------------------------------- > > Key: ARROW-539 > URL: https://issues.apache.org/jira/browse/ARROW-539 > Project: Apache Arrow > Issue Type: New Feature > Components: Python > Reporter: Wes McKinney > Attachments: partitioned_parquet.tar.gz > > > Currently, we only support multi-file directories with a flat structure > (non-partitioned). -- This message was sent by Atlassian JIRA (v6.3.15#6346)