Github user squito commented on the pull request:
https://github.com/apache/spark/pull/11105#issuecomment-193976996
Hi @holdenk, thanks for working on this and sorry for such a long delay on
my end. I think this looks really promising, and would fill a big hole. The
api seems a reasonable compromise given all the tradeoffs. I'm a little
concerned about how this will work with concurrent jobs that share rdds --
seems like it will work but want to think it over a bit more.
My take on the higher level stuff:
> 1) What do we call consistent accumulators? Data property accumulators?
Something else?
yeah I'm as bad at naming as the next person ... I do like to think of
these as being a property of the data, as opposed to the computation, so I'd
like something with that in the name.
Going a step further -- I also think that this actually what spark users
want in almost all cases, and I'd almost prefer this was the default, rather
than requiring a special flag. I think if you are actually trying to measure
something about the computation, you're an advanced user / writing a framework
/ know the internals. That said, I am still a little nervous about changing
something like this with 2.0, just mentioning it in case everyone else feels
similarly.
> 2) Right now they only work inside of MapPartitionRDDs (which covers the
places where user code would want to put data property accumulators) - but do
we want something like this for internal use and if so what situations do we
want them in?
what about `reduceByKey` and friends? and `treeReduce`? User defined
rdds? Should this really be put into *all* RDDs, eg., by adding a new
`private[spark] doCompute()` which wraps the rdd's `compute()` or something
like that?
3) Do we want to allow the user to retrieve the value per RDD for
consistent accumulators?
Even more than that, I wish accumulators made it easy to tell when they
were "ready". That doesn't mean anything in the general case, since there
could always be more updates coming in future actions -- but you could at least
tell if updates from all partitions to one RDD had been included.
That said, I'm ok if folks would prefer to keep out of the initial version
to keep things small.
Without that feature, I dont' really see the point in getting the value
per-RDD. You can always make a separate accumulator per RDD (and probably
should if you want to), I don't see much benefit to letting users query per-RDD.
4) Do we want to expose a Java API right away as well?
I know folks that want this in java & python too, its such an important
feature that I think we need to provide them all :) But that part of the
implementation should be more straightforward, imo you can wait on that till we
iron out the larger design issues.
5) Right now the user facing API just allows users to create accumulators
rather than general accumuables - do we want to offer both?
I don't see any reason why we shouldn't offer both. The one compelling use
I can think of is if you have an array of counters, rather than a bunch of
individually named counters (users bucketed into categories, and you want to
count how many are in each category). You could make an array of accumulators
instead of an accumulator of arrays, but gets little trickier if you have
custom types or something. Again, I think that can wait on design discussion,
and also ok if its added separately. (less important than java & python imo.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]