zhengruifeng commented on code in PR #36648:
URL: https://github.com/apache/spark/pull/36648#discussion_r882261957
##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -2256,9 +2256,12 @@ def sum_with_acc_frame(x) -> ps.DataFrame[np.float64,
np.float64]:
acc += 1
return np.sum(x)
- actual = psdf.groupby("d").apply(sum_with_acc_frame).sort_index()
Review Comment:
this reason is:
1, after this PR, dataframe will not be cached since it only contain 1
partition;
2, there is a global sort in `sort_index`, which contains a sampling that
will trigger an action. This sampling will cause accumulator be computed twice,
this is a already-know issue (see
https://issues.apache.org/jira/browse/SPARK-37487)
There maybe a optimization space that convert global sort on single
partition to local sort on sigle partition, but I am not sure whether it is
worthwhile.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]