Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4634#issuecomment-74732535
There is a case where map-side-combine is actually not the right thing to
do in some of my workflows. map-side-combine makes sense if the overall amount
of data is shrinking as the aggregation is being computed. If the overall size
of the data is the same or larger then it just creates GC stress when the
amount of data being shuffled will be the same regardless if map-side-combine
is used or not.
groupBy is the classic example of this, and I have an aggregation function
that can potentially do a group by along with multiple other aggregations at
the same time. So we can't use groupBy because we're computing both the
per-key-list and other statistics (say, an average per key) at the same time,
which requires a custom aggregation function. But if we can't use groupBy, then
I'm forced to use map-side-combine right now and I would like to specify
turning that off if it doesn't make sense.
cc @MingyuKim
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]