[ 
https://issues.apache.org/jira/browse/SPARK-23361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23361.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.4.0

Issue resolved by pull request 20657
[https://github.com/apache/spark/pull/20657]

> Driver restart fails if it happens after 7 days from app submission
> -------------------------------------------------------------------
>
>                 Key: SPARK-23361
>                 URL: https://issues.apache.org/jira/browse/SPARK-23361
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.0
>            Reporter: Marcelo Vanzin
>            Assignee: Marcelo Vanzin
>            Priority: Major
>             Fix For: 2.4.0
>
>
> If you submit an app that is supposed to run for > 7 days (so using 
> \-\principal / \-\-keytab in cluster mode), and there's a failure that causes 
> the driver to restart after 7 days (that being the default token lifetime for 
> HDFS), the new driver will fail with an error like the following:
> {noformat}
> Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (lots of uninteresting token info) can't be found in cache
>       at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>       at org.apache.hadoop.ipc.Client.call(Client.java:1409)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>       at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
>       at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:606)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>       at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>       at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
>       at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
>       at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
>       at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$6.apply(ApplicationMaster.scala:160)
> {noformat}
> Note: lines may not align with actual Apache code because that's our internal 
> build.
> This happens because as part of the app submission, the launcher provides 
> delegation tokens to be used by the AM (=driver in this case), and those are 
> expired at that point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to