Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/9067#issuecomment-148218738 @rxin, my understanding of this patch is that it lets us continue to perform hash-based pre-aggregation on the remainder of the iterator after we've decided to spill and switch to sort-based aggregation. After we've destructed and freed the original hashMap, we'll now loop back around and continue to use a new hash map to aggregate the remainder of the iterator, spilling that map if it also becomes too large. After this patch, records will have more opportunities to be pre-aggregated before being spilled to disk or sent across the network. I wonder about the case where you're doing map-side pre-aggregation and are experiencing a very low reduction factor: in that case, this patch means that we'll do more work per record, since we'll be hashing and sorting every record, whereas before there was a chance that we'd skip hashing of some records. OTOH, if you have an input consisting of all unique keys then it's best to skip both the hashing AND the sorting and just push the records straight to the reduce side, but that optimization is kind of orthogonal to this one.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org