Question on DAGScheduler.getMissingParentStages()

Madhusudanan Kandasamy Tue, 08 Sep 2015 08:01:46 -0700

Hi,


I'm new to SPARK, trying to understand the DAGScheduler code flow. As per
my understanding it looks like getMissingParentStages() doing a redundant
job of re-calculating stage dependencies. When the first stage is created
all of its dependent/parent stages would be recursively calculated and
stored in stage.parents member. Whenever any given stage needs to be
submitted, it would call getMissingParentStages() to get list of all
un-computed parent stages.

I've expected that getMissingParentStages() would go through stage.parents
and retrieve information about whether they are already computed or not.
However, this function does another graph traversal from the stage.rdd
which seems unnecessary. Is there any specific reason to design like that?
If not, I would like to redesign getMissingParentStages() avoiding the
graph traversal.

Thanks,
Madhu.

Question on DAGScheduler.getMissingParentStages()

Reply via email to