zhengruifeng commented on PR #37845: URL: https://github.com/apache/spark/pull/37845#issuecomment-1243107196
@srowen Existing implementation calls the [`Correlation.corr` ](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala#L67-L75) in the ML side, it accepts a vector column, and it can also handle `NaN`. But Pandas-API-on-Spark uses `null` to internally represent missing values, which will cause an error in `Correlation.corr`. Moreover, in order to support new parameter and lazy evluation, new scenarios (support groupBy/expanding/rolling/corrwith in the future), I think we need a new implementation for correlation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
