[GitHub] [spark] ueshin commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

GitBox Wed, 01 Sep 2021 11:21:25 -0700


ueshin commented on a change in pull request #33752:
URL: https://github.com/apache/spark/pull/33752#discussion_r700456001




##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), 
pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(

Review comment:
       nit: shall we use `pdf` for DataFrame?

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), 
pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)

Review comment:
       ditto: `psdf`?

##########
File path: python/pyspark/pandas/tests/test_ops_on_diff_frames.py
##########
@@ -1793,6 +1793,23 @@ def test_rank(self):
             pdf["Col2"].rank().loc[pdf["Col1"] == 20], 
psdf["Col2"].rank().loc[psdf["Col1"] == 20]
         )
 
+    def test_cov(self):
+        pser1 = pd.Series([0.90010907, 0.13484424, 0.62036035], index=[0, 1, 
2])
+        pser2 = pd.Series([0.12528585, 0.26962463, 0.51111198], index=[1, 2, 
3])
+        pcov = pser1.cov(pser2)
+
+        psser1 = ps.from_pandas(pser1)
+        psser2 = ps.from_pandas(pser2)
+        pscov = psser1.cov(psser2)
+        self.assert_eq(pcov, pscov, almost=True)

Review comment:
       Could you also try the cases:
   - with `min_periods` on boundaries (2 and 3?)
   - when the lengths of Series are different

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -2905,6 +2905,21 @@ def test_combine_first(self):
 
         self.assert_eq(psser1.combine_first(psser2), 
pser1.combine_first(pser2))
 
+    def test_cov_of_series_in_same_frame(self):
+        pser = pd.DataFrame(
+            {
+                "s1": [0.90010907, 0.13484424, 0.62036035],
+                "s2": [0.12528585, 0.26962463, 0.51111198],
+            },
+            index=[0, 1, 2],
+        )
+
+        pcov = pser["s1"].cov(pser["s2"])
+
+        psser = ps.from_pandas(pser)
+        pscov = psser["s1"].cov(psser["s2"])
+        self.assert_eq(pcov, pscov, almost=True)

Review comment:
       Could you also try the cases:
   - with `min_periods` on boundaries (3 and 4?)
   - when the dtypes of the Series are not numeric




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin commented on a change in pull request #33752: [SPARK-36401][PYTHON] Implement Series.cov

Reply via email to