kumarUjjawal commented on code in PR #22764:
URL: https://github.com/apache/datafusion/pull/22764#discussion_r3360714595


##########
datafusion/sqllogictest/test_files/window.slt:
##########


Review Comment:
   Both tests pass the same column twice (covar_pop(column2, column2)). 
Covariance of a column with itself is just variance, and correlation of a 
column with itself is always 1 so the "two different columns" math is never 
really tested. Please use two distinct columns.
   
   Across both tests, a row actually leaves the window only once (first test, 
last row). In the second test the window only ever grows, so nothing is ever 
removed.
   
   There are no NULLs and no case where the window becomes empty in the middle 
of the data



##########
datafusion/functions-aggregate/src/covariance.rs:
##########


Review Comment:
   
https://github.com/apache/datafusion/blob/71b5da1e747fa25d30ef49424d92001bdfd1d573/datafusion/functions-aggregate/src/covariance.rs#L308-L312
   
   When the window is holding a single row and that row leaves, the count drops 
to 0 and the division produces NaN. The internal running values then stay NaN 
forever, so every later  window result silently comes out as NaN instead of the 
right number.
   
   This is reachable with NULL gaps, e.g. a 2-row sliding window over `10.0, 
NULL, NULL, 30.0, 40.0, 50.0` — once the window slides onto the NULL section it 
empties, and all results after that are NaN.
   
   My suggestion would be when the count is about to reach 0, reset the state 
back to its initial values (count 0, running values 0.0) and skip the division.
   
   



##########
datafusion/functions-aggregate/src/correlation.rs:
##########


Review Comment:
   
https://github.com/apache/datafusion/blob/71b5da1e747fa25d30ef49424d92001bdfd1d573/datafusion/functions-aggregate/src/correlation.rs#L203-L205
   
   Same concern as in covariance.rs. There's a check in the result computation 
that treats "both internal averages are NaN" as meaning "the input contained 
NaN values":
   
   An emptied sliding window also makes the averages NaN, so this check gets 
falsely triggered and `corr` returns NaN even for an empty window, where it 
should return NULL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to