andrei-ionescu opened a new pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500


   # Which issue does this PR close?
   
   Closes #1404.
   
    # Rationale for this change
   
   DataFusion lacks support for partition by operation. 
   
   The most used example is" given a dataset, we need to write it down on the 
storage as a partitioned set of files (ie: Parquet dataset partitioned by 
year/month/day, etc). We need to write it as paths like this:
   
   ```
   /dataset/day=2021-12-28/fuel=Gas/
   /dataset/day=2021-12-28/fuel=Diesel/
   /dataset/day=2021-12-27/fuel=Electric/
   ...
   ```
   
   # What changes are included in this PR?
   
   This PR adds a new partitioning method: `Partitioning::PartitionBy`. 
   
   It has two parameters:
   - the expression by which the partitioning will take place
   - an optional value to specify the number of partition to output:
     - the best would be to know the exact number of distinct values by which 
the data will be partitioned, fact that will be beneficial in terms of 
performance
     - if it Is bigger than the  exact number of distinct values for 
partitioning it will return the number of partitions found in the data
     - if is smaller it will return the first n number of partitions dropping 
the other
     - if not specified, it will start from `i16::MAX`
   
   # Are there any user-facing changes?
   
   There is a new partitioning option available as  part of the API.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to