[
https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924468#comment-15924468
]
Wes McKinney commented on ARROW-539:
------------------------------------
Making partitioned tables with Hive or Impala is pretty difficult, here was the
code I used to make one
{code:language=python}
import ibis
import pandas as pd
import hdfs
hdfs = ibis.hdfs_connect('localhost', port=5070)
con = ibis.impala.connect('localhost', port=21050, hdfs_client=hdfs)
df = pd.DataFrame({'year': [2009, 2009, 2009, 2010, 2010, 2010],
'month': ['1', '2', '3', '1', '2', '3'],
'value': [1, 2, 3, 4, 5, 6]})
df = pd.concat([df] * 10, ignore_index=True)
con.create_database('temp_partition', path='/tmp/my_db')
con.create_table('unpartitioned', df, database='temp_partition')
db = con.database('temp_partition')
unpart_t = db.table('unpartitioned')
part_keys = ['year', 'month']
unique_keys = df[part_keys].drop_duplicates()
con.create_table('partitioned', schema=unpart_t.schema(),
database='temp_partition', partition=part_keys)
part_t = db.table('partitioned')
for i, (year, month) in enumerate(unique_keys.itertuples(index=False)):
select_stmt = unpart_t[(unpart_t.year == year) &
(unpart_t.month == month)]
part = {'year': year, 'month': month}
part_t.insert(select_stmt, partition=part)
{code}
Now we have
{code}
>>> hdfs.ls('/tmp/my_db/partitioned')
['_impala_insert_staging', 'year=2009', 'year=2010']
>>> hdfs.ls('/tmp/my_db/partitioned/year=2009')
['month=1', 'month=2', 'month=3']
{code}
Finally I ran
{code}
hdfs.get('/tmp/my_db/partitioned', 'partitioned_parquet')
{code}
to download from HDFS. see attached tarball
> [Python] Support reading Parquet datasets with standard partition directory
> schemes
> -----------------------------------------------------------------------------------
>
> Key: ARROW-539
> URL: https://issues.apache.org/jira/browse/ARROW-539
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Reporter: Wes McKinney
> Attachments: partitioned_parquet.tar.gz
>
>
> Currently, we only support multi-file directories with a flat structure
> (non-partitioned).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)