andrei-ionescu edited a comment on issue #1404: URL: https://github.com/apache/arrow-datafusion/issues/1404#issuecomment-991815767
My use case is exactly the one described above. What I did to have the desired output is this: - read parquet file in a data frame - given the partitioning columns extract all possible values for the chosen columns - given the partition column names and partition values for those columns we construct one filter that we will apply on the data frame to get only one partition out of that data frame - we write the resulted `Vec<RecordBatch>` into Arrow - we use the file writer to write the Arrow structure into file This works pretty well for small datasets (21KB, 6K rows, 72 partitions) but for bigger datasets it falls really far behind Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org