[jira] [Commented] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes

Wes McKinney (JIRA) Tue, 14 Mar 2017 09:03:15 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924468#comment-15924468
 ]


Wes McKinney commented on ARROW-539:
------------------------------------

Making partitioned tables with Hive or Impala is pretty difficult, here was the 
code I used to make one

{code:language=python}
import ibis
import pandas as pd
import hdfs
hdfs = ibis.hdfs_connect('localhost', port=5070)
con = ibis.impala.connect('localhost', port=21050, hdfs_client=hdfs)

df = pd.DataFrame({'year': [2009, 2009, 2009, 2010, 2010, 2010],
                   'month': ['1', '2', '3', '1', '2', '3'],
                   'value': [1, 2, 3, 4, 5, 6]})
df = pd.concat([df] * 10, ignore_index=True)

con.create_database('temp_partition', path='/tmp/my_db')
con.create_table('unpartitioned', df, database='temp_partition')

db = con.database('temp_partition')
unpart_t = db.table('unpartitioned')
part_keys = ['year', 'month']
unique_keys = df[part_keys].drop_duplicates()

con.create_table('partitioned', schema=unpart_t.schema(), 
                 database='temp_partition', partition=part_keys)
part_t = db.table('partitioned')

for i, (year, month) in enumerate(unique_keys.itertuples(index=False)):
    select_stmt = unpart_t[(unpart_t.year == year) &
                           (unpart_t.month == month)]

    part = {'year': year, 'month': month}
    part_t.insert(select_stmt, partition=part)
{code}

Now we have

{code}
>>> hdfs.ls('/tmp/my_db/partitioned')
['_impala_insert_staging', 'year=2009', 'year=2010']

>>> hdfs.ls('/tmp/my_db/partitioned/year=2009')
['month=1', 'month=2', 'month=3']
{code}

Finally I ran

{code}
hdfs.get('/tmp/my_db/partitioned', 'partitioned_parquet')
{code}

to download from HDFS. see attached tarball

> [Python] Support reading Parquet datasets with standard partition directory 
> schemes
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-539
>                 URL: https://issues.apache.org/jira/browse/ARROW-539
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>         Attachments: partitioned_parquet.tar.gz
>
>
> Currently, we only support multi-file directories with a flat structure 
> (non-partitioned). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes

Reply via email to