fwojciec opened a new issue, #20388:
URL: https://github.com/apache/datafusion/issues/20388

   ### Describe the bug
   
   Hi, this is my first issue here! 👋 
   Thank you for this awesome project - I have so much fun building on top of 
what you've built here.
   
   I ran into this while building [quiver](https://github.com/fwojciec/quiver), 
in the context of joining a combination of Postgres tables and views (custom 
`TableProvider`) to a large Parquet table. A query that should have taken ~7s 
was taking ~52s - the physical plan showed `Partitioned` join mode and a 
CASE-form dynamic filter that `PruningPredicate` couldn't evaluate, resulting 
in no row groups pruned. After fixing the statistics issue described below, the 
joins selected `CollectLeft`, the dynamic filter became a simple IN-list with 
range bounds, and 354/362 row groups were pruned.
   
   **Disclosure**
   
   I used Claude (Opus 4.6) to help trace through the statistics propagation 
chain and identify the root cause; the finding was independently confirmed by 
Codex 5.3. I understand the issue end-to-end and have a provisional fix in 
quiver that restored correct CollectLeft join selection and row group pruning 
(52s -> 7.7s performance improvement on my test query).
   
   **The problem**
   
   When a `FilterExec` sits above a table with no column-level statistics (all 
`Precision::Absent` - common for database views or unanalyzed tables), 
`collect_new_statistics` converts the absent min/max into 
`Precision::Exact(ScalarValue::Int32(None))`.
   
   This happens because the interval analysis represents absent bounds as NULL 
`ScalarValue` endpoints, and `collect_new_statistics` doesn't distinguish "NULL 
meaning unbounded" from "a known equal value" - since `NULL == NULL`, the 
`lower.eq(&upper)` branch fires and wraps both in `Precision::Exact`.
   
   Downstream, `estimate_disjoint_inputs` treats these `Exact(NULL)` values as 
real bounds, concludes the join inputs are disjoint, and returns zero 
cardinality. This forces `Partitioned` join mode, which produces CASE-form 
dynamic filters that `PruningPredicate` can't evaluate, disabling row group 
pruning entirely on the probe side.
   
   ### To Reproduce
   
   the easiest way to reproduce is using a unit test against DataFusion 
codebase:
   
   ```rust
   diff --git a/datafusion/physical-plan/src/filter.rs 
b/datafusion/physical-plan/src/filter.rs
   index 2af0731fb..75501b783 100644
   --- a/datafusion/physical-plan/src/filter.rs
   +++ b/datafusion/physical-plan/src/filter.rs
   @@ -2053,4 +2053,43 @@ mod tests {
    
            Ok(())
        }
   +
   +    /// Regression test: columns with Absent min/max statistics must remain
   +    /// Absent after FilterExec, not be converted to Exact(NULL). The latter
   +    /// causes `estimate_disjoint_inputs` to incorrectly conclude join 
inputs
   +    /// are disjoint (ScalarValue's PartialOrd sorts NULLs last), producing
   +    /// zero cardinality and forcing Partitioned join mode.
   +    #[tokio::test]
   +    async fn test_filter_statistics_absent_columns_stay_absent() -> 
Result<()> {
   +        let schema = Schema::new(vec![
   +            Field::new("a", DataType::Int32, false),
   +            Field::new("b", DataType::Int32, false),
   +        ]);
   +        let input = Arc::new(StatisticsExec::new(
   +            Statistics {
   +                num_rows: Precision::Inexact(1000),
   +                total_byte_size: Precision::Absent,
   +                column_statistics: vec![
   +                    ColumnStatistics::default(),
   +                    ColumnStatistics::default(),
   +                ],
   +            },
   +            schema.clone(),
   +        ));
   +
   +        let predicate = Arc::new(BinaryExpr::new(
   +            Arc::new(Column::new("a", 0)),
   +            Operator::Eq,
   +            Arc::new(Literal::new(ScalarValue::Int32(Some(42)))),
   +        ));
   +        let filter: Arc<dyn ExecutionPlan> =
   +            Arc::new(FilterExec::try_new(predicate, input)?);
   +
   +        let statistics = filter.partition_statistics(None)?;
   +        let col_b_stats = &statistics.column_statistics[1];
   +        assert_eq!(col_b_stats.min_value, Precision::Absent);
   +        assert_eq!(col_b_stats.max_value, Precision::Absent);
   +
   +        Ok(())
   +    }
    }
   ```
   
   this test currently fails with:
   
   ```
   assertion `left == right` failed
     left: Exact(Int32(NULL))
     right: Absent
   ```
   
   ### Expected behavior
   
   `collect_new_statistics` should map NULL interval bounds back to 
`Precision::Absent`, not wrap them in `Precision::Exact`.
   
   ### Additional context
   
   I have a working fix for this (and the regression test above) — happy to 
submit as a PR if the approach looks right. Wanted to file the issue first to 
check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to