Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74756966
  
    You lose the parallelism that's inherent in computing the reduce as a 
parallel operation, as opposed to computing it on a list in a single task.
    
    For more context, I'm exposing an aggregation semantic to users where they 
can specify any number of arbitrary aggregations to be computed on a dataset, 
and group-by is only one of those possible aggregations. We take all of the 
aggregations and compute them all in the same combineByKey call. So we can't 
rely on the user always desiring a group-by to be called; although I could 
special-case and introspect the aggregations requested by the user to see if 
group-by is among those aggregations. However what if the user wants 
aggregation on another metric where map-side-combine is once again not optimal? 
It seems better to allow my end user to specify map-side-combine toggling and 
I'll just pass that through to the combine-by-key call.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to