[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-7308: ----------------------------- Assignee: Davies Liu > Should there be multiple concurrent attempts for one stage? > ----------------------------------------------------------- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.1 > Reporter: Imran Rashid > Assignee: Davies Liu > Fix For: 1.5.3, 1.6.0 > > Attachments: SPARK-7308_discussion.pdf > > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > I'll also post some info on how to reproduce this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org