kosiew opened a new issue, #22666:
URL: https://github.com/apache/datafusion/issues/22666

   
   ## Summary
   Grouped `first_value` and `last_value` currently apply aggregate `FILTER` 
using only `BooleanArray::value(idx)`, without checking predicate validity. 
This allows rows where the filter predicate is `NULL` to be treated as passing 
when the underlying value bit is set.
   
   Under SQL aggregate `FILTER` semantics, rows pass only when the filter 
evaluates to `TRUE` (`Some(true)`), and must be excluded when it is `FALSE` or 
`NULL`.
   
   ## Affected Area
   - File: `datafusion/functions-aggregate/src/first_last.rs`
   - Function: `FirstLastGroupsAccumulator::get_filtered_extreme_of_each_group`
   - Current predicate:
     - `let passed_filter = opt_filter.is_none_or(|x| x.value(idx_in_val));`
   
   ## Problem Statement
   The grouped path for `first_value`/`last_value` computes `passed_filter` 
from `BooleanArray::value()` only. For nullable boolean arrays, this can treat 
`NULL` rows as passing if the value bitmap bit is `1`, even though the row 
should be excluded.
   
   This violates SQL semantics for aggregate `FILTER` and can produce incorrect 
non-NULL aggregate results when all rows should be filtered out.
   
   ## Reproduction
   ### SQL
   ```sql
   SELECT
     g,
     first_value(a ORDER BY a) FILTER (WHERE b < 1) AS fv
   FROM (
     VALUES
       (0, 10, CAST(NULL AS INT)),
       (0, 20, 2)
   ) AS t(g, a, b)
   GROUP BY g;
   ```
   
   ### Observed Result
   - `fv = 10`
   
   ### Expected Result
   - `fv = NULL`
   
   Reason:
   - Row `(0, 10, NULL)` has `b < 1` = `NULL` and must not pass `FILTER`.
   - Row `(0, 20, 2)` has `b < 1` = `FALSE` and must not pass `FILTER`.
   - No rows satisfy `Some(true)`, so grouped `first_value` should return 
`NULL`.
   
   ## Root Cause
   In grouped `first_last` processing, filter pass logic does not encode the 
invariant:
   - row passes aggregate filter iff predicate is `Some(true)`
   
   Instead, it checks only `value(idx)`, which ignores nullability at that row.
   
   ## Proposed Fix
   Update grouped filter evaluation in `get_filtered_extreme_of_each_group` to 
require both validity and value (or equivalent `Some(true)` check), for example:
   - `x.is_valid(idx_in_val) && x.value(idx_in_val)`
   - or `x.iter().nth(idx_in_val) == Some(Some(true))`
   - or a shared helper encapsulating `Some(true)` semantics.
   
   A helper-based approach is preferable to avoid semantic drift with other 
grouped aggregate paths.
   
   ## Testing Plan
   1. Add SQLLogicTest coverage for grouped `first_value` with nullable 
`FILTER` predicate returning no `TRUE` rows.
   2. Add the equivalent grouped `last_value` case.
   3. Add at least one mixed case where only some rows satisfy `Some(true)` to 
verify rows with `NULL` predicates are excluded.
   4. Run:
      - `cargo test -p datafusion-functions-aggregate --lib first_last`
      - `cargo test -p datafusion-sqllogictest --test sqllogictests aggregate`
   
   ## Acceptance Criteria
   1. Grouped `first_value` and `last_value` exclude rows where `FILTER` 
evaluates to `NULL`.
   2. Reproducer query returns `NULL` as expected.
   3. New regression tests fail before the fix and pass after.
   4. No behavior regressions for non-null filter predicates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to