alamb commented on issue #21624: URL: https://github.com/apache/datafusion/issues/21624#issuecomment-4253609075
I think this is a great idea and very 💯 Some thoughts: # Two step query: > One idea is that instead of 1 step of "execute query" we have: I think it makes sense for core DataFusion to have the analysis and an API to report what statistics the query optimizer could use (e.g. stats on join columns, etc). I am not sure it makes sense to have an orchestration layer in the core crate to actually fetch those statistics as I think the mechanisms for that (e.g. prefetching, accessing external catalogs, etc) are likely to be system specific Maybe an API similar to [`PruningPredicate::required_columns`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html#method.required_columns) # Usecase for "collect all stats" I think there is a usecase for the "nuclear" statistics option of fetching them all -- namely if you are willing to pay an upfront cost and want the fastest possble queries (looking at you ClickBench, where the table creation/stats collection are not part of the query timings) thus if we make a more incremental statistics collection system I think we should still have some way to collect all the stats up front # Upstream API I suspect we'll need to review the API upstream in the parquet crate to allow incremental statistics collection -- I don't think there is an easy way to add statistics to an already collected statistics object -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
