fwojciec opened a new issue, #20388: URL: https://github.com/apache/datafusion/issues/20388
### Describe the bug Hi, this is my first issue here! 👋 Thank you for this awesome project - I have so much fun building on top of what you've built here. I ran into this while building [quiver](https://github.com/fwojciec/quiver), in the context of joining a combination of Postgres tables and views (custom `TableProvider`) to a large Parquet table. A query that should have taken ~7s was taking ~52s - the physical plan showed `Partitioned` join mode and a CASE-form dynamic filter that `PruningPredicate` couldn't evaluate, resulting in no row groups pruned. After fixing the statistics issue described below, the joins selected `CollectLeft`, the dynamic filter became a simple IN-list with range bounds, and 354/362 row groups were pruned. **Disclosure** I used Claude (Opus 4.6) to help trace through the statistics propagation chain and identify the root cause; the finding was independently confirmed by Codex 5.3. I understand the issue end-to-end and have a provisional fix in quiver that restored correct CollectLeft join selection and row group pruning (52s -> 7.7s performance improvement on my test query). **The problem** When a `FilterExec` sits above a table with no column-level statistics (all `Precision::Absent` - common for database views or unanalyzed tables), `collect_new_statistics` converts the absent min/max into `Precision::Exact(ScalarValue::Int32(None))`. This happens because the interval analysis represents absent bounds as NULL `ScalarValue` endpoints, and `collect_new_statistics` doesn't distinguish "NULL meaning unbounded" from "a known equal value" - since `NULL == NULL`, the `lower.eq(&upper)` branch fires and wraps both in `Precision::Exact`. Downstream, `estimate_disjoint_inputs` treats these `Exact(NULL)` values as real bounds, concludes the join inputs are disjoint, and returns zero cardinality. This forces `Partitioned` join mode, which produces CASE-form dynamic filters that `PruningPredicate` can't evaluate, disabling row group pruning entirely on the probe side. ### To Reproduce the easiest way to reproduce is using a unit test against DataFusion codebase: ```rust diff --git a/datafusion/physical-plan/src/filter.rs b/datafusion/physical-plan/src/filter.rs index 2af0731fb..75501b783 100644 --- a/datafusion/physical-plan/src/filter.rs +++ b/datafusion/physical-plan/src/filter.rs @@ -2053,4 +2053,43 @@ mod tests { Ok(()) } + + /// Regression test: columns with Absent min/max statistics must remain + /// Absent after FilterExec, not be converted to Exact(NULL). The latter + /// causes `estimate_disjoint_inputs` to incorrectly conclude join inputs + /// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing + /// zero cardinality and forcing Partitioned join mode. + #[tokio::test] + async fn test_filter_statistics_absent_columns_stay_absent() -> Result<()> { + let schema = Schema::new(vec![ + Field::new("a", DataType::Int32, false), + Field::new("b", DataType::Int32, false), + ]); + let input = Arc::new(StatisticsExec::new( + Statistics { + num_rows: Precision::Inexact(1000), + total_byte_size: Precision::Absent, + column_statistics: vec![ + ColumnStatistics::default(), + ColumnStatistics::default(), + ], + }, + schema.clone(), + )); + + let predicate = Arc::new(BinaryExpr::new( + Arc::new(Column::new("a", 0)), + Operator::Eq, + Arc::new(Literal::new(ScalarValue::Int32(Some(42)))), + )); + let filter: Arc<dyn ExecutionPlan> = + Arc::new(FilterExec::try_new(predicate, input)?); + + let statistics = filter.partition_statistics(None)?; + let col_b_stats = &statistics.column_statistics[1]; + assert_eq!(col_b_stats.min_value, Precision::Absent); + assert_eq!(col_b_stats.max_value, Precision::Absent); + + Ok(()) + } } ``` this test currently fails with: ``` assertion `left == right` failed left: Exact(Int32(NULL)) right: Absent ``` ### Expected behavior `collect_new_statistics` should map NULL interval bounds back to `Precision::Absent`, not wrap them in `Precision::Exact`. ### Additional context I have a working fix for this (and the regression test above) — happy to submit as a PR if the approach looks right. Wanted to file the issue first to check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
