[GitHub] [arrow-datafusion] alamb commented on issue #6672: Optimization: Avoid sort for already sorted Parquet files that do not overlap values on condition

via GitHub Thu, 15 Jun 2023 07:18:44 -0700


alamb commented on issue #6672:
URL: 
https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1593157575


   > Given the assumptions I can make about the Parquet files, I think that the 
SortPreservingMergeExec can be replaced by what is essentially a concatenation 
of each of the Parquet files.
   
   I agree
   
   > I have an idea of implementing a custom PhysicalOptimizerRule that looks 
for the SortPreservingMergeExec ParquetExec pattern, and replaces it with a 
concatenation instead.
   
   Yes, I think this would work. We do some similar things in IOx 
(interestingly also for the timeseries usecase with non-overlapping 
timeranges). 
   
   It was implemented by @crepererum which you can see in 
https://github.com/influxdata/influxdb_iox/tree/main/iox_query/src/physical_optimizer
   
   > Manually re-partition the Parquet files into a single Parquet file using 
this new API: 
https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column
   
   I think this is likely the solution that would be the fastest for querying 
because then time predicates could be used to prune out the entire row group 
and you would have lower file opening overhead
   
   The downside, is of course, you would need to rewrite the parquet files
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #6672: Optimization: Avoid sort for already sorted Parquet files that do not overlap values on condition

Reply via email to