alamb commented on code in PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#discussion_r3204292651
##########
datafusion/core/src/physical_planner.rs:
##########
@@ -1097,12 +1098,12 @@ impl DefaultPhysicalPlanner {
input_schema.as_arrow(),
)? {
PlanAsyncExpr::Sync(PlannedExprResult::Expr(runtime_expr))
=> {
- FilterExecBuilder::new(
+ let builder = FilterExecBuilder::new(
Review Comment:
a nit is that these changes seem unrelated
##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -125,12 +126,22 @@ impl From<ProjectionExpr> for (Arc<dyn PhysicalExpr>,
String) {
///
/// See [`ProjectionExprs::from_indices`] to select a subset of columns by
/// indices.
-#[derive(Debug, Clone, PartialEq, Eq)]
+#[derive(Debug, Clone)]
pub struct ProjectionExprs {
/// [`Arc`] used for a cheap clone, which improves physical plan
optimization performance.
exprs: Arc<[ProjectionExpr]>,
+ /// Optional expression analyzer registry for statistics estimation
Review Comment:
I think this is basically the same thing @xudong963 is saying in this
comment:
-
https://github.com/apache/datafusion/pull/21122#pullrequestreview-4145718306
##########
datafusion/common/src/config.rs:
##########
@@ -1131,6 +1131,11 @@ config_namespace! {
/// So if you disable `enable_topk_dynamic_filter_pushdown`, then
enable `enable_dynamic_filter_pushdown`, the
`enable_topk_dynamic_filter_pushdown` will be overridden.
pub enable_dynamic_filter_pushdown: bool, default = true
+ /// When set to true, the pluggable `ExpressionAnalyzerRegistry` from
+ /// `SessionState` is used for expression-level statistics estimation
+ /// (NDV, selectivity, min/max, null fraction) in physical plan
operators.
+ pub use_expression_analyzer: bool, default = false
Review Comment:
I wonder why we need a new flag? It seems like in an ideal world, we would
add a new extension API but then refactor the existing code so it used the new
extension API (but kept the existing behavior)
##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -125,12 +126,22 @@ impl From<ProjectionExpr> for (Arc<dyn PhysicalExpr>,
String) {
///
/// See [`ProjectionExprs::from_indices`] to select a subset of columns by
/// indices.
-#[derive(Debug, Clone, PartialEq, Eq)]
+#[derive(Debug, Clone)]
pub struct ProjectionExprs {
/// [`Arc`] used for a cheap clone, which improves physical plan
optimization performance.
exprs: Arc<[ProjectionExpr]>,
+ /// Optional expression analyzer registry for statistics estimation
Review Comment:
If feels akward to me that the ProjectionExprs has an expression analyzer on
it, as that expression analyzer seems like it is really there to be passed into
a call to StatisticsProvider::compute_statistics
In other words, there is one `ExpressionAnalyzerRegistry` per plan, but but
putting it on fields it look like it could be per plan node. In fact, this
isn't even a plan node (it is a field on a plan node)
I also think that having to set the fields correctly means there is a
(real) a danger that it won't be plumbed through properly in the future
It seems to me like a better design would be to pass the
`ExpressionAnalyzerRegistry` down in the callsite where it is needed -- for
example how about adding it as a method on the
StatisticsProvider::compute_statistics? That would ensure it is always passed
where needed and it would remove a lot of the boiler plate in this PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]