[jira] [Commented] (ARROW-1956) Support reading specific partitions from a partitioned parquet dataset

Suvayu Ali (JIRA) Fri, 29 Dec 2017 13:04:14 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306537#comment-16306537
 ]


Suvayu Ali commented on ARROW-1956:
-----------------------------------

Hi Wes,

Inspired by the way PySpark does it, I propose the following.

* Writing partitioned datasets:
{code:none}
writer = PartitionedParquetWriter(basepath, partitions, schema, ...)
{code}
  Rest of the arguments could be identical to ParquetWriter.  For that
  matter, we can also have:
{code:java}
writer = ParquetWriter(where, ..., compression='snappy', partitions=[])
{code}
  For a single file, all constructor arguments are as it is currently,
  and `partitions` is ignored, however when `where` is a directory,
  `partitions` must be a list of column names to partition on.

* Reading partitioned datasets:
{code:java}
dst = ParquetDataset(path_or_paths, validate_schema=True, basepath=None)
{code}
  When `basepath` is `None`, we have the current behaviour, whereas if
  `basepath` is a path, directory hierarchies are detected in
  `path_or_paths`, and each sub-directory is treated as a parquet
  partition in the usual fashion.

What do you think?

If there is someone to provide guidance, I can also work on the implementation. 
 I have lots of free time from the second week of January.

Thanks,

> Support reading specific partitions from a partitioned parquet dataset
> ----------------------------------------------------------------------
>
>                 Key: ARROW-1956
>                 URL: https://issues.apache.org/jira/browse/ARROW-1956
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>    Affects Versions: 0.8.0
>         Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>            Reporter: Suvayu Ali
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.9.0
>
>         Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1956) Support reading specific partitions from a partitioned parquet dataset

Reply via email to