Yikun commented on code in PR #36464:
URL: https://github.com/apache/spark/pull/36464#discussion_r873437306
##########
python/pyspark/pandas/groupby.py:
##########
@@ -2110,22 +2110,60 @@ def _limit(self, n: int, asc: bool) -> FrameLike:
groupkey_scols = [psdf._internal.spark_column_for(label) for label in
groupkey_labels]
sdf = psdf._internal.spark_frame
- tmp_col = verify_temp_column_name(sdf, "__row_number__")
+ window = Window.partitionBy(*groupkey_scols)
# This part is handled differently depending on whether it is a tail
or a head.
- window = (
-
Window.partitionBy(*groupkey_scols).orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).asc())
+ ordered_window = (
+ window.orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).asc())
if asc
- else Window.partitionBy(*groupkey_scols).orderBy(
- F.col(NATURAL_ORDER_COLUMN_NAME).desc()
- )
+ else window.orderBy(F.col(NATURAL_ORDER_COLUMN_NAME).desc())
)
- sdf = (
- sdf.withColumn(tmp_col, F.row_number().over(window))
- .filter(F.col(tmp_col) <= n)
- .drop(tmp_col)
- )
+ if n >= 0 or LooseVersion(pd.__version__) < LooseVersion("1.4.0"):
+ tmp_row_num_col = verify_temp_column_name(sdf, "__row_number__")
+ sdf = (
+ sdf.withColumn(tmp_row_num_col,
F.row_number().over(ordered_window))
+ .filter(F.col(tmp_row_num_col) <= n)
+ .drop(tmp_row_num_col)
+ )
Review Comment:
BTW, we could also consider to unify here to use `lag` way:
```python
sdf = (
sdf.withColumn(tmp_lag_col, F.lag(F.lit(0), n).over(window))
# for positive case
.where(F.isnull(F.col(tmp_lag_col)))
.drop(tmp_lag_col)
)
```
I can submit a separate PR to address it, if you guys think it's necessary.
Theoretically, `lag` has better performance than `row_number` especially when
rows number is very huge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]