[
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-14736:
---------------------------------
Labels: bulk-closed (was: )
> Deadlock in registering applications while the Master is in the RECOVERING
> mode
> -------------------------------------------------------------------------------
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
> Reporter: niranda perera
> Priority: Critical
> Labels: bulk-closed
>
> I have encountered the following issue in the standalone recovery mode.
> Let's say there was an application A running in the cluster. Due to some
> issue, the entire cluster, together with the application A goes down.
> Then later on, cluster comes back online, and the master then goes into the
> 'recovering' mode, because it sees some apps, workers and drivers have
> already been in the cluster from Persistence Engine. While in the recovery
> process, the application comes back online, but now it would have a different
> ID, let's say B.
> But then, as per the master, application registration logic, this application
> B will NOT be added to the 'waitingApps' with the message ""Attempted to
> re-register application at same address". [1]
> private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
> logInfo("Attempted to re-register application at same address: " +
> appAddress)
> return
> }
> The problem here is, master is trying to recover application A, which is not
> in there anymore. Therefore after the recovery process, app A will be
> dropped. However app A's successor, app B was also omitted from the
> 'waitingApps' list because it had the same address as App A previously.
> This creates a deadlock in the cluster, app A nor app B is available in the
> cluster.
> When the master is in the RECOVERING mode, shouldn't it add all the
> registering apps to a list first, and then after the recovery is completed
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]