Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/4634#issuecomment-76695972
So, it seems like there's an argument here that `combineByKey` doesn't add
much over `aggregateByKey`. I agree, although it is slightly more general,
letting you make an initial value as a function of an input, instead of
providing a zero value. But `combineByKey` has all of the advanced options like
`mapSideCombine`.
So if you just need `aggregateByKey`, but do need to control advanced
settings, you have to go down a step to use `combineByKey`. You have to provide
a function to make a zero value, instead of a zero value, which isn't a big
deal. Of course, I don't think the API can be changed in the short term.
Removing `combineByKey` would lose one little bit of control too: zero value as
a function, and as it happens now, control over things like map side combine.
We're left with an argument for API consistency between Java and Scala,
which is compelling. that is, they should at least match, irrespective of what
changes may happen later.
`groupByKey` vs `aggregateByKey` seems like a slightly different question
that results in an alternative suggestions: add this `mapSideCombine` flag to
`aggregateByKey`.
1. Don't change Scala API. Make `combineByKey` consistent in Java API and
expose `mapSideCombine`
2. Add new optional param to Scala `aggregateByKey`. Add to Java
`aggregateByKey` as well.
I slightly prefer 1 because it's a strictly smaller change and leaves
things more API consistent. It seems like purpose of 2 is to fix by removing a
need for `combineByKey` to exist, but, it does, so that's moot to me.
I'd like to proceed with this change, then. It passes tests and does not
affect the API. I'd like to wait a couple days for @pwendell or @rxin since it
has a core API question.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]