alamb commented on issue #6672: URL: https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1593157575
> Given the assumptions I can make about the Parquet files, I think that the SortPreservingMergeExec can be replaced by what is essentially a concatenation of each of the Parquet files. I agree > I have an idea of implementing a custom PhysicalOptimizerRule that looks for the SortPreservingMergeExec ParquetExec pattern, and replaces it with a concatenation instead. Yes, I think this would work. We do some similar things in IOx (interestingly also for the timeseries usecase with non-overlapping timeranges). It was implemented by @crepererum which you can see in https://github.com/influxdata/influxdb_iox/tree/main/iox_query/src/physical_optimizer > Manually re-partition the Parquet files into a single Parquet file using this new API: https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column I think this is likely the solution that would be the fastest for querying because then time predicates could be used to prune out the entire row group and you would have lower file opening overhead The downside, is of course, you would need to rewrite the parquet files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
