Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/10841#issuecomment-177047887
Thanks for taking a look @andrewor14 :) This PR (and the JIRA & design doc)
are all very much focused on enabling the _data property_ use case for
accumulators.
I introduced the GenericAccumulable as a parent to Accumulable to try and
maintain source compatibility between versions and also limit any cost to users
of "regular" accumulators - and since we are coming up on Spark 2.0 that might
not be worth the overhead so I'm more than happy to simplify that part away. Or
if we change the current accumulator base class in such a way it meets the
needs while keeping source compatibility all the better.
For the naming of the Consistent Accumulators (or naming of the flag)
totally open to ideas - I just didn't have anything else come to mind.
So I'm assuming the clunky part of the API your referring to is the
"withAccumulator" part of the transformations - which I do feel is pretty
clunky myself. My initial draft attempted to avoid using this (I first tried
Implementation Option 1 from the design doc (
https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing
) but when multiple transformations are chained together the TaskContext ends
up having the RDD id of the inner most RDD. I'd really love to avoid explicitly
having the user explicitly pass this information in and if there is a permanent
way to do that I'm happy to change this around to give that a shot.
I like the idea of the map of RDD id to accumulated value, since an entire
RDD might not be computed in the same task though I think we will need to keep
track of the value by RDD ID & Partition ID (or at least keep track of value
and bitmask of partitions accumulated per RDD ID).
cc @squito who did the initial accumulator work and we've gone over the
design doc some together.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]