Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

via GitHub Thu, 28 May 2026 11:52:50 -0700


asolimando commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4567251239


   > Thank you — this is a clever approach for enabling external overrides.
   > 
   > I have a suggestion on how to structure this work:
   > 
   > The main motivation for this PR should be to allow external 
`ExpressionAnalyzer`s to override the internal ones. For example, we may have a 
naive internal implementation for estimating the NDV of the `regexp_like()` 
function, while downstream use cases may want to override that estimation with 
a more accurate one.
   > 
   > In that case, I think only the override implementations should live inside 
`ExpressionAnalyzerRegistry`, and we may not need to implement 
`DefaultExpressionAnalyzer`. For internal expressions/UDFs, we can first extend 
the `PhysicalExpr` trait, so their statistics estimation logic be implemented 
inside the trait.
   > 
   > The reasons are:
   > 
   > * To cover common built-in expressions/UDFs, `DefaultExpressionAnalyzer` 
would likely need large match arms covering hundreds of variants, which could 
increase binary size.
   > * Function implementation details would be scattered across multiple 
places: mostly in the function trait, but partly inside `ExpressionAnalyzer`. I 
think this would make the code harder to maintain.
   > 
   > So I suggest we first extend the `PhysicalExpr` trait to support 
expression-level statistics propagation, implement the propagation logic for 
some common expressions there, and then introduce the external 
`ExpressionAnalyzerRegistry` mechanism as a follow-up.
   
   @2010YOUY01, thank you for the detailed feedback, I agree with the direction.
   
   Would you be fine tackling the `PhysicalExpr` extension as a separate PR 
(again part of https://github.com/apache/datafusion/issues/21120)? I would then 
rebase this PR on top of it and #21815.
   
   At that point AND/OR/NOT can recurse through the same `StatisticsContext` 
from #21815, with `ExpressionAnalyzerRegistry` as a field on it. One context 
flows from planner through operators down to expressions, custom overrides on 
children are picked up naturally, no injection needed (as also suggested by 
@alamb).
   
   So the proposed order would be:
   1. #21815
   2. Extension to `PhysicalExpr` (new trait methods, default implementation 
returns `None`, non-breaking)
   3. Resuming this PR and address pending points:
   - adding an end to end, motivating example
   - making the default implementation backwards compatible
   - cleanup post #21815 (avoid extra visits/embedding into operators)
   - and the changes we just discussed
   
   Regarding unifying the 4 methods into a single `analyze()` returning a 
struct/enum - I think keeping them separate works better with the 
`PhysicalExpr` approach: each expression overrides only the stats it knows 
(default `None`), and callers ask for exactly what they need. A single method 
would either force the analyzer to compute all stats even when the caller needs 
just one (e.g., NDV for a projection, selectivity for a filter), or we'd need a 
request parameter to specify what's wanted. Stats computation can be expensive 
for custom analyzers (sampling, external lookups), so lazy per-method 
evaluation seems safer.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ExpressionAnalyzer for pluggable expression-level statistics estimation [datafusion]

Reply via email to