[jira] [Resolved] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

Ben Kietzman (Jira) Tue, 13 Jul 2021 11:59:11 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ben Kietzman resolved ARROW-8655.
---------------------------------
    Resolution: Fixed

Issue resolved by pull request 10661
[https://github.com/apache/arrow/pull/10661]

> [C++][Dataset][Python][R] Preserve partitioning information for a discovered 
> Dataset
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-8655
>                 URL: https://issues.apache.org/jira/browse/ARROW-8655
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
> classes that describe a partitioning used in the discovery phase. But once a 
> dataset object is created, it doesn't know any more about this, it just has 
> partition expressions for the fragments. And the partition keys are added to 
> the schema, but you can't directly know which columns of the schema 
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still 
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already 
> partitioned and you want to automatically preserve that partitioning for 
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it 
> can be useful to know what columns were partition columns (eg for pandas, 
> those columns might be good candidates to be set as the index of the 
> pandas/dask DataFrame). I can imagine conversions to other systems can use 
> similar information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

Reply via email to