Hi,
I'm new to SPARK, trying to understand the DAGScheduler code flow. As per my understanding it looks like getMissingParentStages() doing a redundant job of re-calculating stage dependencies. When the first stage is created all of its dependent/parent stages would be recursively calculated and stored in stage.parents member. Whenever any given stage needs to be submitted, it would call getMissingParentStages() to get list of all un-computed parent stages. I've expected that getMissingParentStages() would go through stage.parents and retrieve information about whether they are already computed or not. However, this function does another graph traversal from the stage.rdd which seems unnecessary. Is there any specific reason to design like that? If not, I would like to redesign getMissingParentStages() avoiding the graph traversal. Thanks, Madhu.