[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

squito Thu, 13 Oct 2016 12:19:31 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11105#discussion_r83291225
  
    --- Diff: core/src/main/scala/org/apache/spark/TaskContextImpl.scala ---
    @@ -126,4 +126,14 @@ private[spark] class TaskContextImpl(
         taskMetrics.registerAccumulator(a)
       }
     
    +  private var rddPartitionInfo: (Int, Int, Int) = null
    --- End diff --
    
    one thing I think really needs to be documented somewhere is *why* we need 
to use these compound ids to track things, why there is a shuffle id, etc..  
Also carrying around this triple is a little cumbersome, so I was thinking that 
maybe it makes sense to create a case class for it, which would also identify 
the purpose.  I'm thinking of something like
    
    ```scala
    /**
     * Identifies where an update for a data property accumulator update came 
from.  This is important to ensure that
      * that updates are not double-counted when rdds get recomputed.  When the 
executors send back data property
     * acucmualtor updates, they seperate out the updates per rdd or shuffle 
output generated by the task.  That
     * gives the driver sufficient info to ensure that each update is counted 
once.
     * [longer description of some of the tricky cases here, especially why we 
care about shuffle id].
    */
    sealed trait TaskOutputId
    case class RDDOutput(rddId: Int, partition: Int)
    case class ShuffleMapOutput(shuffleId: Int, partition: Int)
    ```
    
    doesn't have to be a sealed trait, could just be one class as well.  think 
this would help?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #11105: [SPARK-12469][CORE] Data Property accumulators fo...

Reply via email to