Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2696#discussion_r18864057
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -853,6 +873,12 @@ class SparkContext(config: SparkConf) extends Logging {
       /** The version of Spark on which this application is running. */
       def version = SPARK_VERSION
     
    +  def getJobsIdsForGroup(jobGroup: String): Array[Int] = 
statusApi.jobIdsForGroup(jobGroup)
    +
    +  def getJobInfo(jobId: Int): Option[SparkJobInfo] = 
statusApi.newJobInfo(jobId)
    --- End diff --
    
    The garbage collection / data retention semantics are the same as what's 
displayed in the Spark web UI, since we're built on top of the same listeners.  
While a job is active, we'll keep information on it.  After the job completes / 
fails, a configurable maximum number of jobs and stages will be retained.  I'll 
be sure to clearly document this.
    
    Regarding snapshots / consistency, I added a note about this in one of my 
commit messages, reproduced here:
    
    ```
    - The "consistent snapshot of the entire job -> stage -> task mapping"
      semantics might be very expensive to implement for large jobs, so I've
      decided to remove chaining between SparkJobInfo and SparkStageInfo
      interfaces.  Concretely, this means that you can't write something like
    
         job.stages()(0).name
    
      to get the name of the first stage in a job.  Instead, you have to 
explicitly
      get the stage's ID from the job and then look up that stage using
      sc.getStageInfo().  This isn't to say that we can't implement methods like
      "getNumActiveStages" that reflect consistent state; the goal is mainly to
      avoid spending lots of time / memory to construct huge object graphs.
    ```
    
    My concern was that it may be expensive to snapshot large jobs with many 
stages and tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to