[GitHub] [spark] Yikun commented on a change in pull request #35868: [SPARK-38576][PYTHON] Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

GitBox Mon, 21 Mar 2022 01:25:55 -0700


Yikun commented on a change in pull request #35868:
URL: https://github.com/apache/spark/pull/35868#discussion_r830850186




##########
File path: python/pyspark/pandas/series.py
##########
@@ -3540,8 +3540,10 @@ def quantile(psser: Series) -> Column:
 
             return self._reduce_for_stat_function(quantile, name="quantile")
 
-    # TODO: add axis, numeric_only, pct, na_option parameter
-    def rank(self, method: str = "average", ascending: bool = True) -> 
"Series":
+    # TODO: add axis, pct, na_option parameter
+    def rank(
+        self, method: str = "average", ascending: bool = True, numeric_only: 
Optional[bool] = None

Review comment:
       same test on None explictly

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -1333,6 +1333,17 @@ def test_rank(self):
         self.assert_eq(pser.rank(method="first"), 
psser.rank(method="first").sort_index())
         self.assert_eq(pser.rank(method="dense"), 
psser.rank(method="dense").sort_index())
 
+        non_numeric_pser = pd.Series(["a", "c", "b", "d"], name="x", 
index=[10, 11, 12, 13])
+        non_numeric_psser = ps.from_pandas(non_numeric_pser)
+        self.assert_eq(
+            non_numeric_pser.rank(numeric_only=True).sort_index(),
+            non_numeric_psser.rank(numeric_only=True).sort_index(),
+        )
+        self.assert_eq(
+            (non_numeric_pser + "x").rank(numeric_only=True).sort_index(),
+            (non_numeric_psser + "x").rank(numeric_only=True).sort_index(),
+        )

Review comment:
       ```suggestion
           self.assert_eq(
               non_numeric_pser.rank(numeric_only=True),
               non_numeric_psser.rank(numeric_only=True),
           )
           self.assert_eq(
               (non_numeric_pser + "x").rank(numeric_only=True),
               (non_numeric_psser + "x").rank(numeric_only=True),
           )
   ```
   
   I guess `sort.index` is useless in here, because we will get a empty series

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -10239,8 +10239,10 @@ def any(self, axis: Axis = 0) -> "Series":
 
         return first_series(DataFrame(internal))
 
-    # TODO: add axis, numeric_only, pct, na_option parameter
-    def rank(self, method: str = "average", ascending: bool = True) -> 
"DataFrame":
+    # TODO: add axis, pct, na_option parameter
+    def rank(
+        self, method: str = "average", ascending: bool = True, numeric_only: 
Optional[bool] = None

Review comment:
       
https://github.com/pandas-dev/pandas/blob/6033ed4b3383d874ee4a8a461724c0b8c2ca968d/pandas/core/generic.py#L8651-L8660
   
   Consider rank=`None` would be deprecated in future in pandas, we might want 
to add a test on this (set `rank=None` expilictly), to make sure protect some 
behavior in future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a change in pull request #35868: [SPARK-38576][PYTHON] Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

Reply via email to