xiedeyantu opened a new pull request, #22364:
URL: https://github.com/apache/datafusion/pull/22364

   ## Which issue does this PR close?
   
   - Closes #22190.
   
   ## Rationale for this change
   
   `TopKAggregation` could rewrite grouped `MIN`/`MAX` queries with `ORDER BY 
... LIMIT`
   into the grouped TopK execution path even when the aggregate sort key was 
`NULL`.
   
   The grouped TopK stream previously assumed aggregate sort values reaching 
the heap
   were non-null and skipped rows whose aggregate value was `NULL`. This was 
incorrect
   for queries where a group is valid but its final `MIN`/`MAX` result is `NULL`
   (for example, when all values in the group are `NULL`). In such cases, the 
row
   should still be preserved and ordered according to `NULLS FIRST` / `NULLS 
LAST`
   semantics rather than being dropped.
   
   This change fixes that correctness issue in the execution path instead of 
disabling
   the optimization for nullable aggregates.
   
   ## What changes are included in this PR?
   
   - update primitive TopK heap handling to retain nullable aggregate sort 
values
   - emit primitive heap results with the correct Arrow null buffer
   - remove the row-level `NULL` skip in `GroupedTopKAggregateStream`
   - add a regression test covering grouped `MIN`/`MAX` queries with `NULL` 
aggregate
     results under `ORDER BY ... LIMIT`
   
   ## Are these changes tested?
   
   Yes.
   
   Added a regression test in 
`datafusion/core/tests/physical_optimizer/aggregate_statistics.rs`
   that verifies grouped `MIN`/`MAX` rows with `NULL` aggregate results are 
preserved when
   the TopK aggregation optimization is enabled.
   
   Focused validation run:
   
   - `cargo test -p datafusion-physical-plan aggregates::topk::heap --lib`
   - `cargo test -p datafusion --test core_integration 
null_min_max_topk_preserves_group_rows`
   
   ## Are there any user-facing changes?
   
   Yes.
   
   This fixes an incorrect query result for grouped `MIN`/`MAX` queries using
   `ORDER BY ... LIMIT` when the aggregate sort key is `NULL`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to