andrei-ionescu opened a new pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500
# Which issue does this PR close?
Closes #1404.
# Rationale for this change
DataFusion lacks support for partition by operation.
The most used example is" given a dataset, we need to write it down on the
storage as a partitioned set of files (ie: Parquet dataset partitioned by
year/month/day, etc). We need to write it as paths like this:
```
/dataset/day=2021-12-28/fuel=Gas/
/dataset/day=2021-12-28/fuel=Diesel/
/dataset/day=2021-12-27/fuel=Electric/
...
```
# What changes are included in this PR?
This PR adds a new partitioning method: `Partitioning::PartitionBy`.
It has two parameters:
- the expression by which the partitioning will take place
- an optional value to specify the number of partition to output:
- the best would be to know the exact number of distinct values by which
the data will be partitioned, fact that will be beneficial in terms of
performance
- if it Is bigger than the exact number of distinct values for
partitioning it will return the number of partitions found in the data
- if is smaller it will return the first n number of partitions dropping
the other
- if not specified, it will start from `i16::MAX`
# Are there any user-facing changes?
There is a new partitioning option available as part of the API.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]