[GitHub] [spark] Yikun commented on a change in pull request #35840: [SPARK-38552][PYTHON] Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

GitBox Mon, 21 Mar 2022 02:24:54 -0700


Yikun commented on a change in pull request #35840:
URL: https://github.com/apache/spark/pull/35840#discussion_r830887319




##########
File path: python/pyspark/pandas/frame.py
##########
@@ -6846,7 +6866,16 @@ def _sort(
             (False, "last"): Column.desc_nulls_last,
         }
         by = [mapper[(asc, na_position)](scol) for scol, asc in zip(by, 
ascending)]
-        sdf = self._internal.resolved_copy.spark_frame.sort(*by, 
NATURAL_ORDER_COLUMN_NAME)
+
+        natural_order_scol = F.col(NATURAL_ORDER_COLUMN_NAME)
+
+        if keep == "last":
+            natural_order_scol = Column.desc(natural_order_scol)
+        elif keep != "first":

Review comment:
       `all`: NotImplementedError
   other: ValueError

##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -1789,21 +1789,47 @@ def test_swapaxes(self):
 
     def test_nlargest(self):
         pdf = pd.DataFrame(
-            {"a": [1, 2, 3, 4, 5, None, 7], "b": [7, 6, 5, 4, 3, 2, 1]}, 
index=np.random.rand(7)
+            {"a": [1, 2, 3, 4, 5, None, 7], "b": [7, 6, 5, 4, 3, 2, 1], "c": 
[1, 1, 2, 2, 3, 3, 3]},
+            index=np.random.rand(7),
         )
         psdf = ps.from_pandas(pdf)
         self.assert_eq(psdf.nlargest(n=5, columns="a"), pdf.nlargest(5, 
columns="a"))
         self.assert_eq(psdf.nlargest(n=5, columns=["a", "b"]), pdf.nlargest(5, 
columns=["a", "b"]))
+        self.assert_eq(psdf.nlargest(n=5, columns=["c"]), pdf.nlargest(5, 
columns=["c"]))

Review comment:
       ```suggestion
           self.assert_eq(psdf.nlargest(5, columns=["c"]), pdf.nlargest(5, 
columns=["c"]))
   ```
   
   nits: Looks like there are some irregular in before test.

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7321,6 +7338,10 @@ def nlargest(self, n: int, columns: Union[Name, 
List[Name]]) -> "DataFrame":
             Number of rows to return.
         columns : label or list of labels
             Column label(s) to order by.
+        keep : {'first', 'last'}, default 'first'

Review comment:
       maybe doc some on ``all`` not supported yet

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7438,8 +7498,42 @@ def nsmallest(self, n: int, columns: Union[Name, 
List[Name]]) -> "DataFrame":
         0  1.0   6
         1  2.0   7
         2  3.0   8
+
+        The examples below show how ties are resolved, which is decided by 
`keep`.
+
+        >>> tied_df = ps.DataFrame({'X': [1, 1, 2, 2, 3]}, index=['a', 'b', 
'c', 'd', 'e'])

Review comment:
       it would be good give a multi cols dataframe, otherwise we couldn't see 
any diff on frist/last/default

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7395,6 +7451,10 @@ def nsmallest(self, n: int, columns: Union[Name, 
List[Name]]) -> "DataFrame":
             Number of items to retrieve.
         columns : list or str
             Column name or names to order by.
+        keep : {'first', 'last'}, default 'first'

Review comment:
       ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a change in pull request #35840: [SPARK-38552][PYTHON] Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

Reply via email to