Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8180#discussion_r37672673
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -47,15 +47,48 @@ import 
org.apache.spark.storage.BlockManagerMessages.BlockManagerHeartbeat
      * minimal schedule to run the job. It then submits stages as TaskSets to 
an underlying
      * TaskScheduler implementation that runs them on the cluster.
      *
    - * In addition to coming up with a DAG of stages, this class also 
determines the preferred
    + * Spark stages are created by breaking the RDD graph at shuffle 
boundaries. RDD operations with
    + * "narrow" dependencies, like map() and filter(), are pipelined together 
into one set of tasks
    + * in each stage, but operations with shuffle dependencies require 
multiple stages (one to write a
    + * set of map output files, and another to read those files after a 
barrier). In the end, every
    + * stage will have only shuffle dependencies on other stages, and may 
compute multiple operations
    + * inside it. The actual pipelining of these operations happens in the 
RDD.compute() functions of
    + * various RDDs (MappedRDD, FilteredRDD, etc).
    + *
    + * In addition to coming up with a DAG of stages, the DAGScheduler also 
determines the preferred
      * locations to run each task on, based on the current cache status, and 
passes these to the
      * low-level TaskScheduler. Furthermore, it handles failures due to 
shuffle output files being
      * lost, in which case old stages may need to be resubmitted. Failures 
*within* a stage that are
      * not caused by shuffle file loss are handled by the TaskScheduler, which 
will retry each task
      * a small number of times before cancelling the whole stage.
      *
    + * When looking through this code, there are several key concepts:
    + *
    + *  - Jobs (represented by [[ActiveJob]]) are the top-level work items 
submitted to the scheduler.
    + *    For example, when the user calls an action, like count(), a job will 
be submitted through
    + *    submitJob. Each Job may require the execution of multiple stages to 
build intermediate data.
    + *
    + *  - Stages ([[Stage]]) are sets of tasks that compute intermediate 
results in jobs, where each
    + *    task computes the same function on partitions of the same RDD. 
Stages are separated at shuffle
    + *    boundaries, which introduce a barrier (where we must wait for the 
previous stage to finish to
    + *    fetch outputs). There are two types of stages: [[ResultStage]], for 
the final stage that
    + *    executes an action, and [[ShuffleMapStage]], which writes map output 
files for a shuffle.
    + *    Stages are often shared across multiple jobs, if these jobs reuse 
the same RDDs.
    --- End diff --
    
    Its nice to see these expanded comments, but I think we really need to add 
a section on stage attempts.  That is most probably the most confusing part of 
the dag schduler and where most bugs occur.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to