Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74732535
  
    There is a case where map-side-combine is actually not the right thing to 
do in some of my workflows. map-side-combine makes sense if the overall amount 
of data is shrinking as the aggregation is being computed. If the overall size 
of the data is the same or larger then it just creates GC stress when the 
amount of data being shuffled will be the same regardless if map-side-combine 
is used or not.
    
    groupBy is the classic example of this, and I have an aggregation function 
that can potentially do a group by along with multiple other aggregations at 
the same time. So we can't use groupBy because we're computing both the 
per-key-list and other statistics (say, an average per key) at the same time, 
which requires a custom aggregation function. But if we can't use groupBy, then 
I'm forced to use map-side-combine right now and I would like to specify 
turning that off if it doesn't make sense.
    
    cc @MingyuKim


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to