alamb commented on issue #7490: URL: https://github.com/apache/arrow-datafusion/issues/7490#issuecomment-1724556274
> In fact, when collect_statistics is enabled, the ListingTable already fetches file-level statistics on each query, but discards them after rolling them up into one statistic per column. FYI @Ted-Jiang has added some ability to reuse the statistics: https://github.com/apache/arrow-datafusion/pull/7570 > Describe alternatives you've considered > At my company, we created a custom FileFormat implementation that outputs a wrapped ParquetExec with the output_ordering() method overrided, and the files redistributed to be in-order. FWIW we implemented something similar in IOx https://github.com/influxdata/influxdb_iox > However, instead of using statistics, it relies on hints from configuration provided by the user, plus this does not particularly seem in the spirit of what FileFormat is supposed to be. We would like to implement this optimization in a way that fits better with DataFusion and works out of the box without hints. I agree it would be very nice to have a native way built into DataFusion The solution you describe seems very reasonable to me. Thank you for writing it up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
