Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/9067#issuecomment-148218738
@rxin, my understanding of this patch is that it lets us continue to
perform hash-based pre-aggregation on the remainder of the iterator after we've
decided to spill and switch to sort-based aggregation. After we've destructed
and freed the original hashMap, we'll now loop back around and continue to use
a new hash map to aggregate the remainder of the iterator, spilling that map if
it also becomes too large.
After this patch, records will have more opportunities to be pre-aggregated
before being spilled to disk or sent across the network. I wonder about the
case where you're doing map-side pre-aggregation and are experiencing a very
low reduction factor: in that case, this patch means that we'll do more work
per record, since we'll be hashing and sorting every record, whereas before
there was a chance that we'd skip hashing of some records. OTOH, if you have an
input consisting of all unique keys then it's best to skip both the hashing AND
the sorting and just push the records straight to the reduce side, but that
optimization is kind of orthogonal to this one.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]