Github user squito commented on a diff in the pull request:
    --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ---
    @@ -126,4 +126,14 @@ private[spark] class TaskContextImpl(
    +  private var rddPartitionInfo: (Int, Int, Int) = null
    --- End diff --
    one thing I think really needs to be documented somewhere is *why* we need 
to use these compound ids to track things, why there is a shuffle id, etc..  
Also carrying around this triple is a little cumbersome, so I was thinking that 
maybe it makes sense to create a case class for it, which would also identify 
the purpose.  I'm thinking of something like
     * Identifies where an update for a data property accumulator update came 
from.  This is important to ensure that
      * that updates are not double-counted when rdds get recomputed.  When the 
executors send back data property
     * acucmualtor updates, they seperate out the updates per rdd or shuffle 
output generated by the task.  That
     * gives the driver sufficient info to ensure that each update is counted 
     * [longer description of some of the tricky cases here, especially why we 
care about shuffle id].
    sealed trait TaskOutputId
    case class RDDOutput(rddId: Int, partition: Int)
    case class ShuffleMapOutput(shuffleId: Int, partition: Int)
    doesn't have to be a sealed trait, could just be one class as well.  think 
this would help?

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to