asolimando commented on issue #20852:
URL: https://github.com/apache/datafusion/issues/20852#issuecomment-4033858942

   It would be great to have a way to compute statistics programmatically, but 
the price of computing them generally pays off only if amortized over multiple 
runs, which requires persisting them.
   
   So unless we have a way to do that, the benefit would be limited.
   
   One possible use-case, not requiring persistence, is to enable query replay 
with this command activated on some sub-sample of production traffic, note the 
runtime difference, to easily discover good candidates for statistics 
pre-computation.
   
   In that space, a more standard technique for discovering candidates is to 
look at the query catalog/history: for NDV, columns appearing in join 
conditions, `GROUP BY` clauses, and `DISTINCT` operations are natural 
candidates.
   
   On the persistence side, for Parquet-backed tables, one option is to embed 
computed statistics directly in the files using the user-defined Parquet 
indexes technique: 
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
   
   Stats travel with the data, no external store needed, and other readers 
simply ignore the extra bytes. Though this wouldn't cover non-Parquet sources 
and it would require write permission, but it would be a start.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to