kumarUjjawal opened a new issue, #22754:
URL: https://github.com/apache/datafusion/issues/22754

   ### Describe the bug
   
   SlidingDistinctSumAccumulator (the accumulator used for SUM(DISTINCT) in 
bounded/sliding window frames, added in #16943) has two related null-handling 
bugs:
   
     1. It iterates the raw values buffer, ignoring the validity mask. Both 
update_batch (datafusion/functions-aggregate/src/sum.rs:539-549) and 
retract_batch (sum.rs:595-608) do for
     &v in arr.values(), which yields the physical value at every slot, 
including NULL slots. Arrow does not guarantee the contents of the values 
buffer at null positions, so a NULL
     input row inserts an arbitrary physical value (commonly 0, but kernels can 
leave non-zero garbage) into the distinct-counts map and adds it to the running 
sum on first occurrence.
     With non-zeroed null slots the sum can be arbitrarily wrong; even with 
zeroed slots, NULL rows pollute the distinct key set with a spurious 0 key.
     2. evaluate never returns NULL. sum.rs:552-554 returns 
ScalarValue::Int64(Some(self.sum)) unconditionally, so even after fixing (1), a 
frame containing only NULL rows would return
     0 instead of NULL. (SUM(DISTINCT) of zero non-null values must be NULL; 
the empty-frame case is already handled by the caller's default_value, but a 
non-empty frame of all NULLs
     reaches evaluate.)
   
     For comparison, the non-DISTINCT sliding path (SlidingSumAccumulator, 
sum.rs:468-) handles both cases correctly via compute::sum, which respects the 
validity mask.
   
   ### To Reproduce
   
   ```SQL  
   -- DISTINCT sliding sum: WRONG — ts=1 frame contains only NULL, returns 0
     SELECT ts, sum(DISTINCT v) OVER (ORDER BY ts ROWS BETWEEN 1 PRECEDING AND 
CURRENT ROW) AS s
     FROM (VALUES (1, CAST(NULL AS BIGINT)), (2, 3)) t(ts, v);
     -- +----+---+
     -- | 1  | 0 |   <-- should be NULL
     -- | 2  | 3 |
     -- +----+---+
   
     -- Non-DISTINCT sliding sum: correct
     SELECT ts, sum(v) OVER (ORDER BY ts ROWS BETWEEN 1 PRECEDING AND CURRENT 
ROW) AS s
     FROM (VALUES (1, CAST(NULL AS BIGINT)), (2, 3)) t(ts, v);
     -- +----+------+
     -- | 1  | NULL |
     -- | 2  | 3    |
     -- +----+------+
   
     -- Non-window DISTINCT sum: correct
     SELECT sum(DISTINCT v) FROM (VALUES (CAST(NULL AS BIGINT))) t(v);
     -- NULL
   ```
   
   ### Expected behavior
   
     - NULL input rows are skipped by both update_batch and retract_batch 
(consistent with SUM semantics and with the non-DISTINCT sliding accumulator).
     - evaluate returns Int64(None) when no non-null value is in the frame 
(e.g., when the counts map is empty).
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to