kumarUjjawal commented on code in PR #22764: URL: https://github.com/apache/datafusion/pull/22764#discussion_r3360714595
########## datafusion/sqllogictest/test_files/window.slt: ########## Review Comment: Both tests pass the same column twice (covar_pop(column2, column2)). Covariance of a column with itself is just variance, and correlation of a column with itself is always 1 so the "two different columns" math is never really tested. Please use two distinct columns. Across both tests, a row actually leaves the window only once (first test, last row). In the second test the window only ever grows, so nothing is ever removed. There are no NULLs and no case where the window becomes empty in the middle of the data ########## datafusion/functions-aggregate/src/covariance.rs: ########## Review Comment: https://github.com/apache/datafusion/blob/71b5da1e747fa25d30ef49424d92001bdfd1d573/datafusion/functions-aggregate/src/covariance.rs#L308-L312 When the window is holding a single row and that row leaves, the count drops to 0 and the division produces NaN. The internal running values then stay NaN forever, so every later window result silently comes out as NaN instead of the right number. This is reachable with NULL gaps, e.g. a 2-row sliding window over `10.0, NULL, NULL, 30.0, 40.0, 50.0` — once the window slides onto the NULL section it empties, and all results after that are NaN. My suggestion would be when the count is about to reach 0, reset the state back to its initial values (count 0, running values 0.0) and skip the division. ########## datafusion/functions-aggregate/src/correlation.rs: ########## Review Comment: https://github.com/apache/datafusion/blob/71b5da1e747fa25d30ef49424d92001bdfd1d573/datafusion/functions-aggregate/src/correlation.rs#L203-L205 Same concern as in covariance.rs. There's a check in the result computation that treats "both internal averages are NaN" as meaning "the input contained NaN values": An emptied sliding window also makes the averages NaN, so this check gets falsely triggered and `corr` returns NaN even for an empty window, where it should return NULL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
