kosiew opened a new issue, #22665:
URL: https://github.com/apache/datafusion/issues/22665

   ## Summary
   Aggregate FILTER row-validity handling is currently fragmented across helper 
and grouped accumulator paths. We should centralize the invariant that a row 
passes aggregate FILTER if and only if the predicate is Some(true).
   
   ## Current State
   - `filter_to_validity` was introduced and grouped-accumulator helper paths 
were updated (including `accumulate_multiple` and `filter_to_nulls` plumbing) 
in #22068
   - `first_last.rs` still uses local filter checks in grouped paths.
   - `accumulate_indices` still repeats 64-bit validity iteration logic across 
multiple branches.
   
   ## Problem
   Duplicated nullable-filter logic increases the risk of semantic drift and 
regressions (especially around NULL predicate handling), and makes validity 
bitmap code harder to maintain.
   
   Because grouped and non-grouped paths can evolve independently, this drift 
can become user-visible as incorrect aggregate results under nullable 
predicates.
   
   ## Proposal
   1. Add one shared helper/iterator in `functions-aggregate-common` that 
encodes:
        - row passes FILTER iff predicate is `Some(true)`
   2. Migrate grouped `first_value` / `last_value` filter paths to this shared 
utility.
   3. Refactor `accumulate_indices` to reuse shared validity iteration and 
remove duplicated loops.
   4. Migrate remaining grouped accumulators incrementally.
   
   ## Scope
   In scope:
   - Shared FILTER validity/predicate utility.
   - Targeted migration of `first_last` grouped paths.
   - Targeted cleanup in `accumulate_indices`.
   - Focused test coverage for nullable FILTER predicates.
   
   Out of scope:
   - Broad performance rewrites not tied to FILTER semantics.
   - Large cross-workspace refactors.
   
   ## Acceptance Criteria
   1. A single reusable utility encodes FILTER pass criteria as `Some(true)`.
   2. Grouped `first_value` / `last_value` no longer use bespoke nullable 
filter checks.
   3. `accumulate_indices` no longer carries duplicated validity loop logic 
that can drift semantically.
   4. Existing aggregate FILTER tests pass.
   5. New tests cover nullable FILTER predicates for grouped first/last 
behavior.
   6. No regression for non-null filter predicates.
   
   ## Testing
   - Add SQLLogicTests for grouped `first_value` and `last_value` with nullable 
FILTER predicates.
   - Include cases where value bits may be true while validity is NULL.
   - Validate result is NULL when no rows satisfy `Some(true)`.
   - Run relevant aggregate crate tests and sqllogictest aggregate coverage.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to