[
https://issues.apache.org/jira/browse/SPARK-22218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Graves updated SPARK-22218:
----------------------------------
Description:
Running on yarn, If you have any application re-attempts using the spark 2.2
shuffle service, the external shuffle service does not update the credentials
properly and the application re-attempts fail with
javax.security.sasl.SaslException.
A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager
to use containsKey
(https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
, which is the proper behavior, the problem is that between application
re-attempts it never removes the key. So when the second attempt starts, the
code says it already contains the key (since the application id is the same)
and it doesn't update the secret properly.
to reproduce this you can run something like a word count and have the
directory already existing. The first attempt will fail because the output
directory exists, the subsequent attempts will fail with max number of executor
failures. Note that this is assuming the second and third attempts run on the
same node as the first attempt.
was:
Running on yarn, If you have any application re-attempts using the spark 2.2
shuffle service, the external shuffle service does not update the credentials
properly and the application re-attempts fail with
javax.security.sasl.SaslException.
A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager
to use containsKey
(https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
, which is the proper behavior, the problem is that between application
re-attempts it never removes the key. So when the second attempt starts, the
code says it already contains the key (since the application id is the same)
and it doesn't update the secret properly.
> spark shuffle services fails to update secret on application re-attempts
> ------------------------------------------------------------------------
>
> Key: SPARK-22218
> URL: https://issues.apache.org/jira/browse/SPARK-22218
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, YARN
> Affects Versions: 2.2.0
> Reporter: Thomas Graves
> Priority: Blocker
>
> Running on yarn, If you have any application re-attempts using the spark 2.2
> shuffle service, the external shuffle service does not update the credentials
> properly and the application re-attempts fail with
> javax.security.sasl.SaslException.
> A bug was fixed in 2.2 (SPARK-21494) where it changed the
> ShuffleSecretManager to use containsKey
> (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
> , which is the proper behavior, the problem is that between application
> re-attempts it never removes the key. So when the second attempt starts, the
> code says it already contains the key (since the application id is the same)
> and it doesn't update the secret properly.
> to reproduce this you can run something like a word count and have the
> directory already existing. The first attempt will fail because the output
> directory exists, the subsequent attempts will fail with max number of
> executor failures. Note that this is assuming the second and third attempts
> run on the same node as the first attempt.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]