[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

Joris Van den Bossche (JIRA) Mon, 13 May 2019 03:06:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838442#comment-16838442
 ]


Joris Van den Bossche commented on ARROW-3424:
----------------------------------------------

Currently, a list of files is already supported in {{ParquetDataset}}. So 
something like this (that would address the SO question, I think) works:
 
{code:java}
dataset = pq.ParquetDataset(['part0.parquet', 'part1.parquet'])
dataset.read_pandas().to_pandas()
{code}

Do we think that is enough support? (if so, this issue can be closed I think) 
Or do we want to add this to {{pq.read_table}} ? (which eg also accepts a 
directory name, which is then passed through to {{ParquetDataset}}. We could do 
a similar pass through for a list of paths)


> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-3424
>                 URL: https://issues.apache.org/jira/browse/ARROW-3424
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

Reply via email to