[ 
https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Becker reassigned OOZIE-2887:
------------------------------------

    Assignee:     (was: Daniel Becker)

> Oozie Server hangs when there is a user job has wrong namenode address 
> -----------------------------------------------------------------------
>
>                 Key: OOZIE-2887
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2887
>             Project: Oozie
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.3.0
>            Reporter: Prabhu Joseph
>            Priority: Critical
>
> All the oozie jobs goes to PREP state when a user job tries to connect to 
> wrong namenode address by mistake. Analyzing the jstack, all the threads 
> which tries to submit job waiting to lock "java.util.ServiceLoader"
> {code}
> "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 
> waiting for monitor entry [0x00007f8bf207a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
>         - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
>         at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
>         - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> And the thread which tries to connect to wrong NameNode address which has 
> acquired the lock and keeps on retrying to connect to NameNode for ever. 
> {code}
> "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 
> waiting on condition [0x00007f8bf1f78000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
>         - locked <0x0000000083b80360> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
>         - locked <0x0000000083b80360> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1449)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1396)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>         at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
>         at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
>         at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
>         at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
>         at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>        - locked <0x0000000083b422f8> (a java.lang.Object)
>         at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c498f0> (a java.lang.Object)
>         at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         - locked <0x0000000083c49950> (a java.lang.Object)
>         at 
> org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
>         at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
>         at 
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
>         at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
>         - locked <0x0000000081b29098> (a java.util.ServiceLoader)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
>         at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
>         at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
>         at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
>         at 
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
>         at 
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>         at 
> org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
>         at 
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
>         at 
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
>         at org.apache.oozie.command.XCommand.call(XCommand.java:287)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
>         at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Oozie logs shows the job is retrying to connect to wrong namenode address.
> {code}
> 2017-05-10 05:38:23,194  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:24,574  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:26,918  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:28,659  INFO Client:904 - SERVER[prabhu2] Retrying connect 
> to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy 
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> {code}
> There should be same way to prevent Oozie Server to fall into this trap when 
> some user has wrong details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to