[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

GitBox Sat, 11 Dec 2021 17:26:19 -0800


andrei-ionescu edited a comment on issue #1404:
URL: 
https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-991815767



   My use case is exactly the one described above. What I did to have the 
desired output is this:
   - read parquet file in a data frame
   - given the partitioning columns extract all possible values for the chosen 
columns
   - given the partition column names and partition values for those columns we 
construct one filter that we will apply on the data frame to get only one 
partition out of that data frame
   - we write the resulted `Vec<RecordBatch>` into Arrow
   - we use the file writer to write the Arrow structure into file
   
   This works pretty well for small datasets (21KB, 6K rows, 72 partitions) but 
for bigger datasets it falls really far behind Spark.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andrei-ionescu edited a comment on issue #1404: Hash partitioning not working properly

Reply via email to