asolimando commented on issue #20852: URL: https://github.com/apache/datafusion/issues/20852#issuecomment-4033858942
It would be great to have a way to compute statistics programmatically, but the price of computing them generally pays off only if amortized over multiple runs, which requires persisting them. So unless we have a way to do that, the benefit would be limited. One possible use-case, not requiring persistence, is to enable query replay with this command activated on some sub-sample of production traffic, note the runtime difference, to easily discover good candidates for statistics pre-computation. In that space, a more standard technique for discovering candidates is to look at the query catalog/history: for NDV, columns appearing in join conditions, `GROUP BY` clauses, and `DISTINCT` operations are natural candidates. On the persistence side, for Parquet-backed tables, one option is to embed computed statistics directly in the files using the user-defined Parquet indexes technique: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ Stats travel with the data, no external store needed, and other readers simply ignore the extra bytes. Though this wouldn't cover non-Parquet sources and it would require write permission, but it would be a start. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
