[ 
https://issues.apache.org/jira/browse/SPARK-31347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077533#comment-17077533
 ] 

Babble Shack edited comment on SPARK-31347 at 4/7/20, 8:06 PM:
---------------------------------------------------------------

I was able to resolve this by adding ha configs used in yarn for addressing 
subclusters to the spark-defaults.conf file

e.g.
{code:java}
##YARN Configs
spark.hadoop.yarn.resourcemanager.address 
yarn-master-0.yarn-service.yarn-subcluster-a:8050
spark.hadoop.yarn.resourcemanager.scheduler.address 0.0.0.0:8049
spark.hadoop.yarn.federation.enabled true
#spark.hadoop.yarn.resourcemanager.ha.enabled false
spark.hadoop.yarn.resourcemanager.ha.rm-ids 
yarn-subcluster-a,yarn-subcluster-b,yarn-subcluster-c
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8030
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8030
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8030{code}
Perhaps these should be read from yarn-site.xml located at $HADOOP_CONF_DIR, 
when either `yarn.resourcemanager.ha.enabled` or 
`spark.hadoop.yarn.federation.enabled` is set to true.
 This could happen in the YarnClusterSuite.scala class.

 


was (Author: babbleshack):
I was able to resolve this by adding ha configs used in yarn for addressing 
subclusters to the spark-defaults.conf file

e.g.
{code:java}
##YARN Configs
spark.hadoop.yarn.resourcemanager.address 
yarn-master-0.yarn-service.yarn-subcluster-a:8050
spark.hadoop.yarn.resourcemanager.scheduler.address 0.0.0.0:8049
spark.hadoop.yarn.federation.enabled true
#spark.hadoop.yarn.resourcemanager.ha.enabled false
spark.hadoop.yarn.resourcemanager.ha.rm-ids 
yarn-subcluster-a,yarn-subcluster-b,yarn-subcluster-c
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-a 
yarn-master-0.yarn-service.yarn-subcluster-a:8030
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-b 
yarn-master-0.yarn-service.yarn-subcluster-b:8030
spark.hadoop.yarn.resourcemanager.hostname.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c
spark.hadoop.yarn.resourcemanager.webapp.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8088
spark.hadoop.yarn.resourcemanager.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8032
spark.hadoop.yarn.resourcemanager.scheduler.address.yarn-subcluster-c 
yarn-master-0.yarn-service.yarn-subcluster-c:8030{code}
Perhaps these should be read from yarn-site.xml located at $HADOOP_CONF_DIR, 
when either `yarn.resourcemanager.ha.enabled` or 
`spark.hadoop.yarn.federation.enabled` is set to true in yarn-site.xml located 
at $HADOOP_CONF_DIR
 This could happen in the YarnClusterSuite.scala class.

 

> Unable to run Spark Job on Federated Yarn Cluster, AMRMToken invalid
> --------------------------------------------------------------------
>
>                 Key: SPARK-31347
>                 URL: https://issues.apache.org/jira/browse/SPARK-31347
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, YARN
>    Affects Versions: 3.0.0
>            Reporter: Babble Shack
>            Priority: Major
>         Attachments: mapred.log, mapred.out, router-yarn-site.xml, 
> spark.debug.log, spark.log, spark.out
>
>
> Running Spark on Yarn 3.2.1 in federated cluster
> ApplicationMaster fails to register with resourcemanager, and throws a 
> InvalidToken exception.
> {code:java}
> root@yarn-master-0:/hadoop/spark# HADOOP_CONF_DIR=/hadoop/federation/router 
> ./bin/spark-submit \          
> --class org.apache.spark.examples.SparkPi \                                   
>                                                                               
>                                                                               
>                                                                               
>       
> --master yarn \                                                               
>                                                                               
>    
> --deploy-mode cluster \                                                    
> --driver-memory 4g \                                                          
>                                                                               
>                                                                               
>                                                                               
>       
> --executor-memory 2g \                                                        
>                                                                               
>    
> --executor-cores 1 \                                                          
>                                                                               
>    
> --queue default \                                                             
>                                                                               
>                                                                               
>                                                                               
>       
> examples/jars/spark-examples*.jar 10                                          
>                                                                               
>                                                                               
>                                                                               
>       
>                                                                               
>                                                                               
>    
> 2020-04-04 16:44:18,144 WARN util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable      
> 2020-04-04 16:44:18,345 INFO client.AHSProxy: Connecting to Application 
> History server at /0.0.0.0:10200                                              
>                                                                               
>                                                                               
>             
> 2020-04-04 16:44:18,402 INFO yarn.Client: Requesting a new application from 
> cluster with 10 NodeManagers
> 2020-04-04 16:44:18,753 INFO conf.Configuration: resource-types.xml not found 
>                                                                               
>    
> 2020-04-04 16:44:18,754 INFO resource.ResourceUtils: Unable to find 
> 'resource-types.xml'.                                                         
>              
> 2020-04-04 16:44:18,766 INFO yarn.Client: Verifying our application has not 
> requested more than the maximum memory capability of the cluster (7168 MB per 
> container)
> 2020-04-04 16:44:18,767 INFO yarn.Client: Will allocate AM container, with 
> 4505 MB memory including 409 MB overhead                                      
>       
> 2020-04-04 16:44:18,767 INFO yarn.Client: Setting up container launch context 
> for our AM                                                                    
>    
> 2020-04-04 16:44:18,768 INFO yarn.Client: Setting up the launch environment 
> for our AM container                                                          
>                                                                               
>                                                                               
>         
> 2020-04-04 16:44:18,776 INFO yarn.Client: Preparing resources for our AM 
> container                                                                     
>         
> 2020-04-04 16:44:18,805 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.                                                                   
>                                                                               
>                        
> 2020-04-04 16:44:19,890 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_libs__5444968329971306297.zip
>  -> 
> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_libs__5444968329971306297.zip
> 2020-04-04 16:44:22,689 INFO yarn.Client: Uploading resource 
> file:/hadoop/spark/examples/jars/spark-examples_2.12-3.0.0-preview2.jar -> 
> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/spark-examples_2.12-3.0.0-preview2.jar
> 2020-04-04 16:44:22,832 INFO yarn.Client: Uploading resource 
> file:/tmp/spark-cfcf1976-612e-4b64-8bf3-5b0c8f1dc6ec/__spark_conf__2558260056925734476.zip
>  -> 
> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005/__spark_conf__.zip
> 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls to: 
> root                                                                          
>       
> 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing modify acls to: 
> root                                                                          
>                                                                               
>                                                                               
>        
> 2020-04-04 16:44:22,886 INFO spark.SecurityManager: Changing view acls groups 
> to:                                                                           
>    
> 2020-04-04 16:44:22,887 INFO spark.SecurityManager: Changing modify acls 
> groups to:                                                                    
>         
> 2020-04-04 16:44:22,887 INFO spark.SecurityManager: SecurityManager: 
> authentication disabled; ui acls disabled; users  with view permissions: 
> Set(root); groups with view permissions: Set(); users  with modify 
> permissions: Set(root); groups with modify permissions: Set()
> 2020-04-04 16:44:22,927 INFO yarn.Client: Submitting application 
> application_1586018216728_0005 to ResourceManager                             
>                 
> 2020-04-04 16:44:22,963 INFO impl.YarnClientImpl: Submitted application 
> application_1586018216728_0005                                                
>          
> 2020-04-04 16:44:23,967 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:23,969 INFO yarn.Client:                                     
>                                                                               
>                                                                               
>                                                                               
>       
>          client token: N/A                                                    
>                                                                               
>                                                                               
>                                                                               
>       
>          diagnostics: AM container is launched, waiting for AM container to 
> Register with RM                                                              
>      
>          ApplicationMaster host: N/A                                          
>                                                                               
>    
>          ApplicationMaster RPC port: -1                       
>          queue: default                                                       
>                                                                               
>    
>          start time: 1586018662937                                            
>                                                                               
>    
>          final status: UNDEFINED                                              
>                                                                               
>    
>          tracking URL: 
> http://yarn-master-0.yarn-service.yarn-subcluster-a.svc.cluster.local:8088/proxy/application_1586018216728_0005/
>                                                                               
>                                                                               
>                            
>          user: root                                                           
>                                                                               
>    
> 2020-04-04 16:44:24,972 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:25,974 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:26,977 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)  
> 2020-04-04 16:44:27,980 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)  
> 2020-04-04 16:44:28,983 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:29,985 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:30,988 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)
> 2020-04-04 16:44:31,991 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)
> 2020-04-04 16:44:32,994 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: ACCEPTED)                              
>                                                                               
>                                                                               
>                    
> 2020-04-04 16:44:33,996 INFO yarn.Client: Application report for 
> application_1586018216728_0005 (state: FAILED)
> 2020-04-04 16:44:33,997 INFO yarn.Client:                                     
>                                                                               
>    
>          client token: N/A                                                    
>                                                                               
>    
>          diagnostics: Application application_1586018216728_0005 failed 2 
> times due to AM Container for appattempt_1586018216728_0005_000002 exited 
> with  exitCode: 13
> Failing this attempt.Diagnostics: [2020-04-04 16:44:33.276]Exception from 
> container-launch.                        
> Container id: container_e27933_1586018216728_0005_02_000001                   
>                                                                               
>    
> Exit code: 13 
>  
> [2020-04-04 16:44:33.297]Container exited with a non-zero exit code 13. Error 
> file: prelaunch.err. 
> Last 4096 bytes of prelaunch.err : 
> Last 4096 bytes of stderr : 
> ect.Constructor.newInstance(Constructor.java:423) 
>  at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) 
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80) 
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119) 
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy16.registerApplicationMaster(Unknown Source)
>  at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:246)
>  at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:233)
>  at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:213)
>  at org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:71)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:426)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:504)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:262)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:875)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:874)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:874)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid AMRMToken from appattempt_1586018216728_0005_000002
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1457)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1367)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>  at com.sun.proxy.$Proxy15.registerApplicationMaster(Unknown Source)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:107)
>  ... 24 more
> )
> 2020-04-04 16:44:32,555 INFO yarn.ApplicationMaster: Deleting staging 
> directory 
> hdfs://hdfs-master-0.hdfs-service.hdfs:9000/user/root/.sparkStaging/application_1586018216728_0005
> 2020-04-04 16:44:32,926 INFO storage.DiskBlockManager: Shutdown hook called
> 2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Shutdown hook called
> 2020-04-04 16:44:32,930 INFO util.ShutdownHookManager: Deleting directory 
> /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7/userFiles-1721c4df-1674-4695-b3aa-02e8c72908c0
> 2020-04-04 16:44:32,932 INFO util.ShutdownHookManager: Deleting directory 
> /opt/hadoop/hadooptmpdata/nm-local-dir/usercache/root/appcache/application_1586018216728_0005/spark-5d3f083f-eb43-49e9-a779-2354e07e9bd7
>                                                                               
>                                                                  
> {code}
>  
> Submitting this here and not in Yarn Jira because Hadoop Mapred Jobs run 
> normally in the same cluster.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to