Github user squito commented on the pull request:
https://github.com/apache/spark/pull/11105#issuecomment-198444401
@rxin totally understand that this might have an unacceptable impact on
performance. That part remains to be explored. For now the focus has mostly
been on trying to attain the desired semantics.
I sort-of disagree with the other points (perhaps depends a bit on
interpretation). I think holden _is_ working through this in a very detailed
fashion. This is labeled "rfc / wip", its not meant to be merged today. That
process is just happening in the open, so more community members can be
involved. For now, the focus is on getting the semantics right, you can the
effort is on coming up with test cases for all these different scenarios and
making sure things make sense. If the performance turns out to be
unacceptable, well then that gives us a place to work from -- we could then
consider other solutions that perhaps need to compromise a bit on semantics but
do not negatively impact performance.
Can you point to the semantics you disagree with? There is only one case I
see from the document you mentioned -- this covers that case and plenty more as
well:
```scala
rdd.map { i => acc += 1; i }
rdd.count()
rdd.count()
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ//
vs.
rdd2 = rdd.map { i => acc += 1; i }
rdd.count()
rdd2.count()
```
I agree that its not obvious what the semantics should be here, but this
actually addresses it in a few ways: (a) most importantly, it just chooses
*some* semantics which are clearly defined -- the updates from each RDD are
counted exactly once, so in the first case the value is N, in the second case
its 2N (I assume in that example, `rdd` still has an accumulator increment in
its definition?). This is enough info for the user to decide what to do (they
can always create a second accumulator, now that the semantics are understood).
And (b), it actually lets the user choose. As holden has pointed out, with
this approach, you could also keep the accumulator value per-RDD, eg.
`acc.rddValue(rdd)` and `acc.rddValue(rdd2)`. That's not currently exposed
(just to limit the api changes), but that could be added. This gives
well-defined semantics and a lot of flexibility.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]