Re: [I] `datafusion.execution.collect_statistics` on wide tables [datafusion]

via GitHub Wed, 15 Apr 2026 09:07:53 -0700


alamb commented on issue #21624:
URL: https://github.com/apache/datafusion/issues/21624#issuecomment-4253609075


   I think this is a great idea and very 💯 
   
   Some thoughts:
   
   # Two step query:
   
   > One idea is that instead of 1 step of "execute query" we have:
   
   I think it makes sense for core DataFusion to have the analysis and an API 
to report what statistics the query optimizer could use  (e.g. stats on join 
columns, etc). I am not sure it makes sense to have an orchestration layer in 
the core crate to actually fetch those statistics as I think the mechanisms for 
that (e.g. prefetching, accessing external catalogs, etc) are likely to be 
system specific
   
   Maybe an API similar to 
[`PruningPredicate::required_columns`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#method.required_columns)
   
   
   
   # Usecase for "collect all stats" 
   I think there is a usecase for the "nuclear" statistics option of fetching 
them all -- namely if you are willing to pay an upfront cost and want the 
fastest possble queries (looking at you ClickBench, where the table 
creation/stats collection are not part of the query timings)
   
   thus if we make a more incremental statistics collection system I think we 
should still have some way to collect all the stats up front
   
   # Upstream API
   
   I suspect we'll need to review the API upstream in the parquet crate to 
allow incremental statistics collection -- I don't think there is an easy way 
to add statistics to an already collected statistics object


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `datafusion.execution.collect_statistics` on wide tables [datafusion]

Reply via email to