[
https://issues.apache.org/jira/browse/SPARK-22218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcelo Vanzin updated SPARK-22218:
-----------------------------------
Affects Version/s: (was: 2.2.0)
2.2.1
> spark shuffle services fails to update secret on application re-attempts
> ------------------------------------------------------------------------
>
> Key: SPARK-22218
> URL: https://issues.apache.org/jira/browse/SPARK-22218
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, YARN
> Affects Versions: 2.2.1
> Reporter: Thomas Graves
> Priority: Blocker
> Fix For: 2.2.1, 2.3.0
>
>
> Running on yarn, If you have any application re-attempts using the spark 2.2
> shuffle service, the external shuffle service does not update the credentials
> properly and the application re-attempts fail with
> javax.security.sasl.SaslException.
> A bug was fixed in 2.2 (SPARK-21494) where it changed the
> ShuffleSecretManager to use containsKey
> (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
> , which is the proper behavior, the problem is that between application
> re-attempts it never removes the key. So when the second attempt starts, the
> code says it already contains the key (since the application id is the same)
> and it doesn't update the secret properly.
> to reproduce this you can run something like a word count and have the
> directory already existing. The first attempt will fail because the output
> directory exists, the subsequent attempts will fail with max number of
> executor failures. Note that this is assuming the second and third attempts
> run on the same node as the first attempt.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]