[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

GitBox Wed, 18 Aug 2021 23:07:59 -0700


itholic commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r691809706




##########
File path: python/pyspark/pandas/series.py
##########
@@ -944,6 +944,47 @@ def between(self, left: Any, right: Any, inclusive: bool = 
True) -> "Series":
 
         return lmask & rmask
 
+    def cov(self, other: "Series", min_periods: Optional[int] = None) -> float:
+        """
+        Compute covariance with Series, excluding missing values.
+
+        Parameters
+        ----------
+        other : Series
+            Series with which to compute the covariance.
+        min_periods : int, optional
+            Minimum number of observations needed to have a valid result.
+
+        Returns
+        -------
+        float
+            Covariance between Series and other
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+        >>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+        >>> s2 = ps.Series([0.12528585, 0.26962463, 0.51111198])
+        >>> s1.cov(s2)
+        -0.016857626527158744
+        >>> reset_option("compute.ops_on_diff_frames")
+        """

Review comment:
       Can we add the type checking logic for `other` and related tests ??
   
   If `other` is not Series, we should manually catch it and raise `TypeError` 
as pandas does.
   
   ```python
   >>> pser.cov([0.12528585, 0.26962463, 0.51111198])
   Traceback (most recent call last):
   ...
   TypeError: unsupported type: <class 'list'>
   ```
   
   

##########
File path: python/pyspark/pandas/tests/test_ops_on_diff_frames.py
##########
@@ -1955,6 +1956,20 @@ def test_pow_and_rpow(self):
         with self.assertRaisesRegex(ValueError, "Cannot combine the series or 
dataframe"):
             psser.rpow(psser_other)
 
+    def test_cov(self):
+        from pyspark.pandas.config import set_option, reset_option
+
+        set_option("compute.ops_on_diff_frames", True)
+        pser1 = pd.Series([0.90010907, 0.13484424, 0.62036035], index=[0, 1, 
2])
+        pser2 = pd.Series([0.12528585, 0.26962463, 0.51111198], index=[1, 2, 
3])
+        pcov = pser1.cov(pser2)
+
+        psser1 = ps.from_pandas(pser1)
+        psser2 = ps.from_pandas(pser2)
+        pscov = psser1.cov(psser2)
+        self.assert_eq(math.isclose(pcov, pscov), True)

Review comment:
       How about `self.assert_eq(pcov, pscov, almost=True)` ??

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2885,6 +2886,21 @@ def test_at_time(self):
             psser.at_time("0:20").sort_index(),
         )
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)
+        pscov = psser["s1"].cov(psser["s2"])
+        self.assert_eq(math.isclose(pcov, pscov), True)

Review comment:
       ditto ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

Reply via email to