Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10841#issuecomment-177047887
  
    Thanks for taking a look @andrewor14 :) This PR (and the JIRA & design doc) 
are all very much focused on enabling the _data property_ use case for 
accumulators.
    
    I introduced the GenericAccumulable as a parent to Accumulable to try and 
maintain source compatibility between versions and also limit any cost to users 
of "regular" accumulators - and since we are coming up on Spark 2.0 that might 
not be worth the overhead so I'm more than happy to simplify that part away. Or 
if we change the current accumulator base class in such a way it meets the 
needs while keeping source compatibility all the better.
    
    For the naming of the Consistent Accumulators (or naming of the flag) 
totally open to ideas - I just didn't have anything else come to mind.
    
    So I'm assuming the clunky part of the API your referring to is the 
"withAccumulator" part of the transformations - which I do feel is pretty 
clunky myself. My initial draft attempted to avoid using this (I first tried 
Implementation Option 1 from the design doc ( 
https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing
 ) but when multiple transformations are chained together the TaskContext ends 
up having the RDD id of the inner most RDD. I'd really love to avoid explicitly 
having the user explicitly pass this information in and if there is a permanent 
way to do that I'm happy to change this around to give that a shot.
    
    I like the idea of the map of RDD id to accumulated value, since an entire 
RDD might not be computed in the same task though I think we will need to keep 
track of the value by RDD ID & Partition ID (or at least keep track of value 
and bitmask of partitions accumulated per RDD ID). 
    
    cc @squito who did the initial accumulator work and we've gone over the 
design doc some together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to