Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/11105#discussion_r86486669
--- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala ---
@@ -69,6 +69,20 @@ object TaskContext {
}
}
+/**
+ * Identifies where an update for a data property accumulator update came
from. This is important
+ * to ensure that updates are not double-counted when rdds get recomputed.
When the executors send
+ * back data property acucmualtor updates, they seperate out the updates
per rdd or shuffle output
+ * generated by the task. That gives the driver sufficient info to ensure
that each update is
+ * counted once. The shuffleId is important since two seperate shuffle
actions could happen on the
+ * same RDD and the RDD ID and partition ID with different accumulators.
+ * Any accumulators used inside of `runJob` directly are always counted
because there is no
+ * resubmition of `runJob`/`foreach`.
--- End diff --
Can you add something here about why this is important for tracking
incomplete partitions, eg. with repeatedly partially computing partitions with
`take()`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]