[GitHub] [spark] zhengruifeng commented on a diff in pull request #36648: [SPARK-39268][SQL][WIP] AttachDistributedSequenceExec do not checkpoint childRDD with single partition

GitBox Wed, 25 May 2022 19:28:39 -0700


zhengruifeng commented on code in PR #36648:
URL: https://github.com/apache/spark/pull/36648#discussion_r882261957



##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -2256,9 +2256,12 @@ def sum_with_acc_frame(x) -> ps.DataFrame[np.float64, 
np.float64]:
             acc += 1
             return np.sum(x)
 
-        actual = psdf.groupby("d").apply(sum_with_acc_frame).sort_index()

Review Comment:
   this reason is:
   
   1, after this PR, dataframe will not be cached since it only contain 1 
partition;
   2, there is a global sort in `sort_index`, which contains a sampling that 
will trigger an action. This sampling will cause accumulator be computed twice, 
this is a already-know issue (see 
https://issues.apache.org/jira/browse/SPARK-37487)
   
   There maybe a optimization space that convert global sort on single 
partition to local sort on sigle partition, but I am not sure whether it is 
worthwhile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on a diff in pull request #36648: [SPARK-39268][SQL][WIP] AttachDistributedSequenceExec do not checkpoint childRDD with single partition

Reply via email to