[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

Ying Wang (JIRA) Fri, 07 Sep 2018 12:17:13 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607546#comment-16607546
 ]


Ying Wang commented on ARROW-1956:
----------------------------------

I don't know if this is helpful to people, but I found myself needing to ingest 
an entire Parquet dataset at once (database company) and I came up with this:

 

```python

import pyarrow.parquet as pq

 

dataset = pq.ParquetDataset('/path/to/dataset')

dataset_pieces = dataset.pieces # ParquetDataset is composed of a list of 
ParquetDatasetPieces

for dataset_piece in dataset_pieces:

    df = dataset_piece.read(partitions=dataset.partitions).to_pandas() # 
dataset.partitions is ParquetPartitions object

    # do whatever with dataframe

```

It'll be slow but you can parallelize it as you want and each dataframe will 
contain the full dataset schema (as opposed to reading the individual 
ParquetFile which will not include partition keys as part of the schema).

> [Python] Support reading specific partitions from a partitioned parquet 
> dataset
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-1956
>                 URL: https://issues.apache.org/jira/browse/ARROW-1956
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>    Affects Versions: 0.8.0
>         Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>            Reporter: Suvayu Ali
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.11.0
>
>         Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1956) [Python] Support reading specific partitions from a partitioned parquet dataset

Reply via email to