[ 
https://issues.apache.org/jira/browse/SPARK-22218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-22218:
----------------------------------
    Description: 
Running on yarn, If you have any application re-attempts using the spark 2.2 
shuffle service, the external shuffle service does not update the credentials 
properly and the application re-attempts fail with 
javax.security.sasl.SaslException. 

A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager 
to use containsKey 
(https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
 , which is the proper behavior, the problem is that between application 
re-attempts it never removes the key. So when the second attempt starts, the 
code says it already contains the key (since the application id is the same) 
and it doesn't update the secret properly.

to reproduce this you can run something like a word count and have the 
directory already existing.  The first attempt will fail because the output 
directory exists, the subsequent attempts will fail with max number of executor 
failures.   Note that this is assuming the second and third attempts run on the 
same node as the first attempt.

  was:
Running on yarn, If you have any application re-attempts using the spark 2.2 
shuffle service, the external shuffle service does not update the credentials 
properly and the application re-attempts fail with 
javax.security.sasl.SaslException. 

A bug was fixed in 2.2 (SPARK-21494) where it changed the ShuffleSecretManager 
to use containsKey 
(https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
 , which is the proper behavior, the problem is that between application 
re-attempts it never removes the key. So when the second attempt starts, the 
code says it already contains the key (since the application id is the same) 
and it doesn't update the secret properly.


> spark shuffle services fails to update secret on application re-attempts
> ------------------------------------------------------------------------
>
>                 Key: SPARK-22218
>                 URL: https://issues.apache.org/jira/browse/SPARK-22218
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, YARN
>    Affects Versions: 2.2.0
>            Reporter: Thomas Graves
>            Priority: Blocker
>
> Running on yarn, If you have any application re-attempts using the spark 2.2 
> shuffle service, the external shuffle service does not update the credentials 
> properly and the application re-attempts fail with 
> javax.security.sasl.SaslException. 
> A bug was fixed in 2.2 (SPARK-21494) where it changed the 
> ShuffleSecretManager to use containsKey 
> (https://git.corp.yahoo.com/hadoop/spark/blob/yspark_2_2_0/common/network-shuffle/src/main/java/org/apache/spark/network/sasl/ShuffleSecretManager.java#L50)
>  , which is the proper behavior, the problem is that between application 
> re-attempts it never removes the key. So when the second attempt starts, the 
> code says it already contains the key (since the application id is the same) 
> and it doesn't update the secret properly.
> to reproduce this you can run something like a word count and have the 
> directory already existing.  The first attempt will fail because the output 
> directory exists, the subsequent attempts will fail with max number of 
> executor failures.   Note that this is assuming the second and third attempts 
> run on the same node as the first attempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to