Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9067#issuecomment-148218738
  
    @rxin, my understanding of this patch is that it lets us continue to 
perform hash-based pre-aggregation on the remainder of the iterator after we've 
decided to spill and switch to sort-based aggregation. After we've destructed 
and freed the original hashMap, we'll now loop back around and continue to use 
a new hash map to aggregate the remainder of the iterator, spilling that map if 
it also becomes too large.
    
    After this patch, records will have more opportunities to be pre-aggregated 
before being spilled to disk or sent across the network. I wonder about the 
case where you're doing map-side pre-aggregation and are experiencing a very 
low reduction factor: in that case, this patch means that we'll do more work 
per record, since we'll be hashing and sorting every record, whereas before 
there was a chance that we'd skip hashing of some records. OTOH, if you have an 
input consisting of all unique keys then it's best to skip both the hashing AND 
the sorting and just push the records straight to the reduce side, but that 
optimization is kind of orthogonal to this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to