[GitHub] [arrow-datafusion] alamb commented on issue #7490: Use file statistics in query planning

via GitHub Mon, 18 Sep 2023 15:37:24 -0700


alamb commented on issue #7490:
URL: 
https://github.com/apache/arrow-datafusion/issues/7490#issuecomment-1724556274


   >  In fact, when collect_statistics is enabled, the ListingTable already 
fetches file-level statistics on each query, but discards them after rolling 
them up into one statistic per column.
   
   FYI @Ted-Jiang  has added some ability to reuse the statistics: 
https://github.com/apache/arrow-datafusion/pull/7570
   
   > Describe alternatives you've considered
   > At my company, we created a custom FileFormat implementation that outputs 
a wrapped ParquetExec with the output_ordering() method overrided, and the 
files redistributed to be in-order.
   
   FWIW we implemented something similar in IOx 
https://github.com/influxdata/influxdb_iox
   
   >  However, instead of using statistics, it relies on hints from 
configuration provided by the user, plus this does not particularly seem in the 
spirit of what FileFormat is supposed to be. We would like to implement this 
optimization in a way that fits better with DataFusion and works out of the box 
without hints.
   
   I agree it would be very nice to have a native way built into DataFusion
   
   The solution you describe seems very reasonable to me. Thank you for writing 
it up


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #7490: Use file statistics in query planning

Reply via email to