Jefffrey opened a new issue, #19612:
URL: https://github.com/apache/datafusion/issues/19612

   ### Describe the bug
   
   Originally from this discussion: 
https://github.com/apache/datafusion/pull/19278#discussion_r2616514268
   
   It turns out aggregate functions can still be used with window frames even 
if they don't implement `retract_batch`, if the window frame is `ROWS BETWEEN 
UNBOUNDED PRECEDING AND CURRENT ROW`. And if the accumulator consumes the 
internal state as part of `evaluate`, it can return incorrect results.
   
   ### To Reproduce
   
   In `aggregate.slt`, try this:
   
   ```sql
   query ITRRR
   SELECT
       timestamp,
       tags,
       value,
       median(value) OVER (
           PARTITION BY tags
           ORDER BY timestamp
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ),
       percentile_cont(value, 0.5) OVER (
           PARTITION BY tags
           ORDER BY timestamp
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       )
   FROM median_window_test
   ORDER BY tags, timestamp;
   ----
   1 tag1 10 10 10
   2 tag1 20 15 15
   3 tag1 30 20 20
   4 tag1 40 25 25
   5 tag1 50 30 30
   1 tag2 60 60 60
   2 tag2 70 65 65
   3 tag2 80 70 70
   4 tag2 90 75 75
   5 tag2 100 80 80
   ```
   
   We would expect those last two columns to be the same. However the actual 
results are:
   
   ```sql
   query ITRRR
   SELECT
       timestamp,
       tags,
       value,
       median(value) OVER (
           PARTITION BY tags
           ORDER BY timestamp
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       ),
       percentile_cont(value, 0.5) OVER (
           PARTITION BY tags
           ORDER BY timestamp
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
       )
   FROM median_window_test
   ORDER BY tags, timestamp;
   ----
   1 tag1 10 10 10
   2 tag1 20 15 20
   3 tag1 30 20 30
   4 tag1 40 25 40
   5 tag1 50 30 50
   1 tag2 60 60 60
   2 tag2 70 65 70
   3 tag2 80 70 80
   4 tag2 90 75 90
   5 tag2 100 80 100
   ```
   
   - See last column; it exactly follows the input data because it keeps 
discarding its state
   
   ### Expected behavior
   
   I've identified the following aggregate functions which likely exhibit this 
behaviour:
   
   - percentile_cont
   - string_agg
   - distinct median
   
   median was fixed by https://github.com/apache/datafusion/pull/19278
   
   We need to fix these accumulators to not consume the internal state on 
evaluate. We also would want to update the documentation around accumulators to 
make this behaviour obvious so we don't fall into the same trap later.
   
   
   ### Additional context
   
   Comment on why consuming internal state is bad for windows: 
https://github.com/apache/datafusion/pull/19278#discussion_r2609476348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to