[ 
https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112427#comment-16112427
 ] 

Prabhu Joseph commented on OOZIE-2887:
--------------------------------------

The issue happens even with job.properties having the correct namenode address 
when the NameNode nn1 machine is down. WhiteList configuration does not help 
here.

{code}
Repro:

NameNode HA - nn1, nn2 
Shutdown nn1
yarn.timeline.service.enabled true
Now all oozie jobs will go to PREP where one thread will keep on retrying to 
connect to nn1 node and other threads waiting to lock the object.
{code}


> Oozie Server hangs when there is a user job has wrong namenode address 
> -----------------------------------------------------------------------
>
>                 Key: OOZIE-2887
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2887
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.3.0
>            Reporter: Prabhu Joseph
>            Priority: Critical
>
> All the oozie jobs goes to PREP state when a user job tries to connect to 
> wrong namenode address by mistake. Analyzing the jstack, all the threads 
> which tries to submit job waiting to lock "java.util.ServiceLoader"
> {code}
> "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 
> waiting for monitor entry [0x00007f8bf207a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
>         - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
>         - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> And the thread which tries to connect to wrong NameNode address which has 
> acquired the lock and keeps on retrying to connect to NameNode for ever. 
> {code}
> "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 
> waiting on condition [0x00007f8bf1f78000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
>         - locked <0x0000000083b80360> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
>         - locked <0x0000000083b80360> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1396)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>         at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
>         at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
>         at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
>         at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
>         at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>        - locked <0x0000000083b422f8> (a java.lang.Object)
>         at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c498f0> (a java.lang.Object)
>         at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c49950> (a java.lang.Object)
>         at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
>         at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
>         at 
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>         - locked <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
>         at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
>         at 
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
>         at 
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at 
> org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Oozie logs shows the job is retrying to connect to wrong namenode address.
> {code}
> 2017-05-10 05:38:23,194  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:24,574  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:26,918  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:28,659  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> {code}
> There should be same way to prevent Oozie Server to fall into this trap when 
> some user has wrong details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to