[GitHub] [spark] itholic commented on a diff in pull request #37945: [SPARK-40498][PS] Implement `kendall` and `min_periods` in `Series.corr`

GitBox Tue, 20 Sep 2022 22:48:05 -0700


itholic commented on code in PR #37945:
URL: https://github.com/apache/spark/pull/37945#discussion_r975971640



##########
python/pyspark/pandas/series.py:
##########
@@ -3312,16 +3318,25 @@ def autocorr(self, periods: int = 1) -> float:
             )
         return np.nan if corr is None else corr
 
-    def corr(self, other: "Series", method: str = "pearson") -> float:
+    def corr(
+        self, other: "Series", method: str = "pearson", min_periods: 
Optional[int] = None
+    ) -> float:
         """
         Compute correlation with `other` Series, excluding missing values.
 
+        .. versionadded:: 3.3.0
+
         Parameters
         ----------
         other : Series
-        method : {'pearson', 'spearman'}
+        method : {'pearson', 'spearman', 'kendall'}
             * pearson : standard correlation coefficient
             * spearman : Spearman rank correlation
+            * kendall : Kendall Tau correlation coefficient

Review Comment:
   Maybe we should add `.. versionchanged:: 3.4.0` for mentioning about the 
`kendall` ??
   
   ```python
           method : {'pearson', 'spearman', 'kendall'}
               * pearson : standard correlation coefficient
               * spearman : Spearman rank correlation
               * kendall : Kendall Tau correlation coefficient
   
               .. versionchanged:: 3.4.0
                   support 'kendall' for method parameter
   ```



##########
python/pyspark/pandas/tests/test_stats.py:
##########
@@ -258,8 +258,6 @@ def test_skew_kurt_numerical_stability(self):
         self.assert_eq(psdf.kurt(), pdf.kurt(), almost=True)
 
     def test_dataframe_corr(self):
-        # existing 'test_corr' is mixed by df.corr and ser.corr, will delete 
'test_corr'
-        # when we have separate tests for df.corr and ser.corr

Review Comment:
   Nice!



##########
python/pyspark/pandas/series.py:
##########
@@ -3333,29 +3348,74 @@ def corr(self, other: "Series", method: str = 
"pearson") -> float:
         ...                    's2': [.3, .6, .0, .1]})
         >>> s1 = df.s1
         >>> s2 = df.s2
-        >>> s1.corr(s2, method='pearson')  # doctest: +ELLIPSIS
-        -0.851064...
+        >>> s1.corr(s2, method='pearson')
+        -0.85106...
 
-        >>> s1.corr(s2, method='spearman')  # doctest: +ELLIPSIS
-        -0.948683...
+        >>> s1.corr(s2, method='spearman')
+        -0.94868...
 
-        Notes
-        -----
-        There are behavior differences between pandas-on-Spark and pandas.

Review Comment:
   Oh, so now we can have the same behavior with pandas ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a diff in pull request #37945: [SPARK-40498][PS] Implement `kendall` and `min_periods` in `Series.corr`

Reply via email to