asolimando commented on PR #21122: URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4567251239
> Thank you — this is a clever approach for enabling external overrides. > > I have a suggestion on how to structure this work: > > The main motivation for this PR should be to allow external `ExpressionAnalyzer`s to override the internal ones. For example, we may have a naive internal implementation for estimating the NDV of the `regexp_like()` function, while downstream use cases may want to override that estimation with a more accurate one. > > In that case, I think only the override implementations should live inside `ExpressionAnalyzerRegistry`, and we may not need to implement `DefaultExpressionAnalyzer`. For internal expressions/UDFs, we can first extend the `PhysicalExpr` trait, so their statistics estimation logic be implemented inside the trait. > > The reasons are: > > * To cover common built-in expressions/UDFs, `DefaultExpressionAnalyzer` would likely need large match arms covering hundreds of variants, which could increase binary size. > * Function implementation details would be scattered across multiple places: mostly in the function trait, but partly inside `ExpressionAnalyzer`. I think this would make the code harder to maintain. > > So I suggest we first extend the `PhysicalExpr` trait to support expression-level statistics propagation, implement the propagation logic for some common expressions there, and then introduce the external `ExpressionAnalyzerRegistry` mechanism as a follow-up. @2010YOUY01, thank you for the detailed feedback, I agree with the direction. Would you be fine tackling the `PhysicalExpr` extension as a separate PR (again part of https://github.com/apache/datafusion/issues/21120)? I would then rebase this PR on top of it and #21815. At that point AND/OR/NOT can recurse through the same `StatisticsContext` from #21815, with `ExpressionAnalyzerRegistry` as a field on it. One context flows from planner through operators down to expressions, custom overrides on children are picked up naturally, no injection needed (as also suggested by @alamb). So the proposed order would be: 1. #21815 2. Extension to `PhysicalExpr` (new trait methods, default implementation returns `None`, non-breaking) 3. Resuming this PR and address pending points: - adding an end to end, motivating example - making the default implementation backwards compatible - cleanup post #21815 (avoid extra visits/embedding into operators) - and the changes we just discussed Regarding unifying the 4 methods into a single `analyze()` returning a struct/enum - I think keeping them separate works better with the `PhysicalExpr` approach: each expression overrides only the stats it knows (default `None`), and callers ask for exactly what they need. A single method would either force the analyzer to compute all stats even when the caller needs just one (e.g., NDV for a projection, selectivity for a filter), or we'd need a request parameter to specify what's wanted. Stats computation can be expensive for custom analyzers (sampling, external lookups), so lazy per-method evaluation seems safer. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
