zhengchenyu commented on PR #44512: URL: https://github.com/apache/spark/pull/44512#issuecomment-1963155578
@waitinfuture (1) performance differs Experimental data were recorded in https://issues.apache.org/jira/browse/SPARK-46512. How much performance improvement depends on the experimental environment. In my experiments, the correlation time was reduced from 75 seconds to 29 seconds. (2) total number of spilling differs The reduction in the number of spills is obvious. Before this PR, we combine the unsorted records, then sort. When we combine the unsorted records, we us ExternalAppendOnlyMap. They may spill for large data. Then when we sort, we still spill for large data. After this PR, when we sort, we can easily organize the same keys together, and then we no longer have to use ExternalAppendOnlyMap to combine. After this PR, we will save the spilled process which we must do in combine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
