[
https://issues.apache.org/jira/browse/SPARK-9256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Scott updated SPARK-9256:
-------------------------------
Priority: Major (was: Minor)
> Message delay causes Master crash upon registering application
> --------------------------------------------------------------
>
> Key: SPARK-9256
> URL: https://issues.apache.org/jira/browse/SPARK-9256
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Colin Scott
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> This bug occurs when `spark.deploy.recoveryMode` is set to "FILESYSTEM", and
> I believe it is only possible to trigger in production when the AppClient and
> Master are on different machines.
> As part of initialization, the AppClient
> [registers|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala#L124]
> with the Master by repeatedly sending a RegisterApplication message until it
> receives a RegisteredApplication response.
> If the RegisteredApplication response is delayed by at least
> REGISTRATION_TIMEOUT_SECONDS (or if the network duplicates the
> RegisterApplication RPC), it is possible for the Master to receive *two*
> RegisterApplication messages for the same AppClient.
> Upon receiving the second RegisterApplication message, the master
> [attempts|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L274]
> to persist the ApplicationInfo to disk. Since the file already exists,
> FileSystemPersistenceEngine
> [throws|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala#L59]
> an IllegalStateException, and the Master crashes.
> Incidentally, it appears that there is already a
> [TODO|https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L266]
> in the code to handle this scenario.
> I have a reproducing scenario for this bug on an old version of Spark
> (1.0.1), but upon inspecting the latest version of the code it appears that
> it is still possible to trigger it. Let me know if you would like reproducing
> steps for triggering it on the old version of Spark.
> It should be possible to trigger this bug even if the underlying transport
> protocol is TCP, since TCP only guarantees in-order delivery in each
> direction of the connection but not in both directions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]