[ 
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375352#comment-17375352
 ] 

Joris Van den Bossche commented on ARROW-8655:
----------------------------------------------

(this has come up again in https://github.com/dask/dask/pull/7557 (cc 
[~fjetter]), and I also wondered about while deprecating similar functionality 
in ParquetDataset in ARROW-13074 / https://github.com/apache/arrow/pull/10549, 
so trying to revive this issue)

Trying to think through what information could be useful to expose, I think 
there are two levels to potentially expose information: the dataset and the 
fragment.

For the full Dataset:

* The names of the partition fields (in correct order), and maybe full schema 
(i.e. including the type)
* All possible values a specific partition field can take?

For the individual fragments:

* The actual field values for each of the partition fields/keys, such as the 
mapping that is currently returned by 
{{ds._get_partition_keys(fragment.partition_expression}} (assuming this would 
preserve order, is there anything more needed here? Or "just" a more official 
(public) method to get this information?)

For the dataset-level, we could maybe simply expose the "finished" Partitioning 
object that is created while creating the FileSystemDataset through the factory 
method. Currently, this is Partitioning object is discarded, but we could pass 
it through to the FileSystemDataset to preserve the partitioning object from 
which it was created. 

> [C++][Dataset][Python][R] Preserve partitioning information for a discovered 
> Dataset
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-8655
>                 URL: https://issues.apache.org/jira/browse/ARROW-8655
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
> classes that describe a partitioning used in the discovery phase. But once a 
> dataset object is created, it doesn't know any more about this, it just has 
> partition expressions for the fragments. And the partition keys are added to 
> the schema, but you can't directly know which columns of the schema 
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still 
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already 
> partitioned and you want to automatically preserve that partitioning for 
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it 
> can be useful to know what columns were partition columns (eg for pandas, 
> those columns might be good candidates to be set as the index of the 
> pandas/dask DataFrame). I can imagine conversions to other systems can use 
> similar information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to