Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/11105#discussion_r83291225
--- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ---
@@ -126,4 +126,14 @@ private[spark] class TaskContextImpl(
taskMetrics.registerAccumulator(a)
}
+ private var rddPartitionInfo: (Int, Int, Int) = null
--- End diff --
one thing I think really needs to be documented somewhere is *why* we need
to use these compound ids to track things, why there is a shuffle id, etc..
Also carrying around this triple is a little cumbersome, so I was thinking that
maybe it makes sense to create a case class for it, which would also identify
the purpose. I'm thinking of something like
```scala
/**
* Identifies where an update for a data property accumulator update came
from. This is important to ensure that
* that updates are not double-counted when rdds get recomputed. When the
executors send back data property
* acucmualtor updates, they seperate out the updates per rdd or shuffle
output generated by the task. That
* gives the driver sufficient info to ensure that each update is counted
once.
* [longer description of some of the tricky cases here, especially why we
care about shuffle id].
*/
sealed trait TaskOutputId
case class RDDOutput(rddId: Int, partition: Int)
case class ShuffleMapOutput(shuffleId: Int, partition: Int)
```
doesn't have to be a sealed trait, could just be one class as well. think
this would help?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]