[
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375871#comment-17375871
]
Weston Pace commented on ARROW-8655:
------------------------------------
I think the only thing tricky here is the unique values. In the general case a
dataset may not know all possible values. The "partition_expression" of a
fragment is not required to be an equality expression (or even several ANDed
together). Technically there is nothing against creating a union dataset,
perhaps composed of a CSV dataset (where all data has timestamp < 2020) and a
parquet dataset (where all data has timestamp > 2020) because the company
changed how they stored data at some point.
Scanning for all values currently used is something that happens in the
factory->dataset part (which I suppose is kind of hidden in the current python
implementation). Maybe there is something we can add to the dataset factory so
that calling Finish (or perhaps adding a new property that can be accessed
after calling Finish) could return dictionaries of everything it discovered.
> [C++][Dataset][Python][R] Preserve partitioning information for a discovered
> Dataset
> ------------------------------------------------------------------------------------
>
> Key: ARROW-8655
> URL: https://issues.apache.org/jira/browse/ARROW-8655
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}}
> classes that describe a partitioning used in the discovery phase. But once a
> dataset object is created, it doesn't know any more about this, it just has
> partition expressions for the fragments. And the partition keys are added to
> the schema, but you can't directly know which columns of the schema
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already
> partitioned and you want to automatically preserve that partitioning for
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it
> can be useful to know what columns were partition columns (eg for pandas,
> those columns might be good candidates to be set as the index of the
> pandas/dask DataFrame). I can imagine conversions to other systems can use
> similar information.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)