kosiew opened a new pull request, #16258:
URL: https://github.com/apache/datafusion/pull/16258

   ## Which issue does this PR close?
   
   Closes #16228 
   <!-- Replace <ISSUE_NUMBER> with the actual GitHub issue number once known 
-->
   
   ## Rationale for this change
   
   `Array::is_null` does not correctly identify nulls for `DictionaryArray` 
when the indices point to nulls in the values array. This causes incorrect 
results in aggregation queries such as `count(distinct ...)`, which should skip 
nulls but currently may include them due to improper null handling. The change 
ensures nulls in dictionary values are correctly detected and excluded.
   
   [Arrow's hands are tied on this 
matter](https://github.com/apache/arrow-rs/pull/7608) and so we are fixing the 
issue in this repo.
   
   ## What changes are included in this PR?
   
   - Updated the logic in `DistinctCountAccumulator` to use 
`ScalarValue::is_null()` instead of relying solely on `Array::is_null()` for 
determining null entries.
   - Added SQL logic tests to confirm correct behavior when `DictionaryArray` 
contains only null values.
   
   ## Are these changes tested?
   
   Yes, tests have been added to `sqllogictest/test_files/aggregate.slt` to 
verify that `count(distinct ...)` correctly returns `0` when all dictionary 
values are null. These tests cover both query logic and table lifecycle 
(create/drop).
   
   ## Are there any user-facing changes?
   
   Yes. This change corrects the results of `count(distinct ...)` queries 
involving `DictionaryArray` columns with nulls in the value array. Users can 
now expect consistent and correct results across different partition settings 
and query plans.
   
   <!-- If there are any breaking changes to public APIs, please add the `api 
change` label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to