[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

GitBox Tue, 17 Aug 2021 23:40:29 -0700


itholic commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r690917187




##########
File path: python/pyspark/pandas/series.py
##########
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
         return lmask & rmask
 
+    def cov(self, other: "Series", min_periods: int = 1) -> float:
+        """
+        Compute covariance with Series, excluding missing values.
+        Parameters
+        ----------
+        other : Series
+            Series with which to compute the covariance.
+        min_periods : int, default 1
+            Minimum number of observations needed to have a valid result. None 
= 1.

Review comment:
       So we don't need `None = 1` here, too.

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2885,6 +2885,35 @@ def test_at_time(self):
             psser.at_time("0:20").sort_index(),
         )
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)
+        pscov = psser["s1"].cov(psser["s2"])
+        self.assert_eq(math.isclose(pcov, pscov), True)
+
+    def test_cov_of_series_in_diff_frames(self):

Review comment:
       Can we move this to `tests/test_ops_on_diff_frames.py` ??
   
   Also we can say just simply `test_cov` here, rather than 
`test_cov_of_series_in_same_frame` after moving this.

##########
File path: python/pyspark/pandas/series.py
##########
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
         return lmask & rmask
 
+    def cov(self, other: "Series", min_periods: int = 1) -> float:
+        """
+        Compute covariance with Series, excluding missing values.
+        Parameters
+        ----------
+        other : Series
+            Series with which to compute the covariance.
+        min_periods : int, default 1
+            Minimum number of observations needed to have a valid result. None 
= 1.
+
+        Returns
+        -------
+        float
+            Covariance between Series and other
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+        >>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+        >>> s2 = ps.Series([0.12528585, 0.26962463, 0.51111198])
+        >>> s1.cov(s2)
+        -0.016857626527158744
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+
+        if min_periods is None:
+            min_periods = 1
+
+        if same_anchor(self, other):
+            self_column_label = verify_temp_column_name(other.to_frame(), 
"__self_column__")
+            other_column_label = verify_temp_column_name(self.to_frame(), 
"__other_column__")
+            combined = DataFrame(
+                self._internal.with_new_columns(
+                    [self.rename(self_column_label), 
other.rename(other_column_label)]
+                )
+            )

Review comment:
       If `self` and `other` have anchor, I think we don't need to create 
another DataFrame.
   
   We can just simply select the columns from internal Spark DataFrame as below.
   
   ```python
   self._internal.spark_frame.select(F.covar_samp(self.spark.column, 
other.spark.column)).head(1)[0][0]
   ```

##########
File path: python/pyspark/pandas/series.py
##########
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
         return lmask & rmask
 
+    def cov(self, other: "Series", min_periods: int = 1) -> float:
+        """
+        Compute covariance with Series, excluding missing values.

Review comment:
       Can we have a newline here??

##########
File path: python/pyspark/pandas/series.py
##########
@@ -944,6 +944,57 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
         return lmask & rmask
 
+    def cov(self, other: "Series", min_periods: int = 1) -> float:

Review comment:
       The default value for `min_periods` should be `Nonw` what you did before 
:-)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

Reply via email to