[GitHub] [spark] zhengruifeng commented on pull request #37845: [SPARK-40399][PS] Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods `

GitBox Sun, 11 Sep 2022 18:27:49 -0700


zhengruifeng commented on PR #37845:
URL: https://github.com/apache/spark/pull/37845#issuecomment-1243107196


   @srowen   Existing implementation calls the [`Correlation.corr` 
](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/stat/Correlation.scala#L67-L75)
 in the ML side, it accepts a vector column, and it can also handle `NaN`.
   
   But Pandas-API-on-Spark uses `null` to internally represent missing values, 
which will cause an error in `Correlation.corr`. Moreover, in order to support 
new parameter and lazy evluation, new scenarios (support 
groupBy/expanding/rolling/corrwith in the future), I think we need a new 
implementation for correlation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #37845: [SPARK-40399][PS] Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods `

Reply via email to