[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

squito Thu, 03 Nov 2016 21:59:51 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11105#discussion_r86486669
  
    --- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala ---
    @@ -69,6 +69,20 @@ object TaskContext {
       }
     }
     
    +/**
    + * Identifies where an update for a data property accumulator update came 
from.  This is important
    + * to ensure that updates are not double-counted when rdds get recomputed. 
 When the executors send
    + * back data property acucmualtor updates, they seperate out the updates 
per rdd or shuffle output
    + * generated by the task. That gives the driver sufficient info to ensure 
that each update is
    + * counted once. The shuffleId is important since two seperate shuffle 
actions could happen on the
    + * same RDD and the RDD ID and partition ID with different accumulators.
    + * Any accumulators used inside of `runJob` directly are always counted 
because there is no
    + * resubmition of `runJob`/`foreach`.
    --- End diff --
    
    Can you add something here about why this is important for tracking 
incomplete partitions, eg. with repeatedly partially computing partitions with 
`take()`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

Reply via email to