[
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Kietzman resolved ARROW-8655.
---------------------------------
Resolution: Fixed
Issue resolved by pull request 10661
[https://github.com/apache/arrow/pull/10661]
> [C++][Dataset][Python][R] Preserve partitioning information for a discovered
> Dataset
> ------------------------------------------------------------------------------------
>
> Key: ARROW-8655
> URL: https://issues.apache.org/jira/browse/ARROW-8655
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}}
> classes that describe a partitioning used in the discovery phase. But once a
> dataset object is created, it doesn't know any more about this, it just has
> partition expressions for the fragments. And the partition keys are added to
> the schema, but you can't directly know which columns of the schema
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already
> partitioned and you want to automatically preserve that partitioning for
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it
> can be useful to know what columns were partition columns (eg for pandas,
> those columns might be good candidates to be set as the index of the
> pandas/dask DataFrame). I can imagine conversions to other systems can use
> similar information.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)