[
https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741422#comment-16741422
]
duan xiong commented on OOZIE-2887:
-----------------------------------
[~Prabhu Joseph] Hi,When we use Hadoop HA,Hadoop can provide one public domain
name,For examples: hdfs://hadoop.com,Then when we submit job, we can use this
value replace nn1,nn2, this method can avoid this problem.
> Oozie Server hangs when there is a user job has wrong namenode address
> -----------------------------------------------------------------------
>
> Key: OOZIE-2887
> URL: https://issues.apache.org/jira/browse/OOZIE-2887
> Project: Oozie
> Issue Type: Bug
> Components: core
> Affects Versions: 4.3.0
> Reporter: Prabhu Joseph
> Priority: Critical
>
> All the oozie jobs goes to PREP state when a user job tries to connect to
> wrong namenode address by mistake. Analyzing the jstack, all the threads
> which tries to submit job waiting to lock "java.util.ServiceLoader"
> {code}
> "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468
> waiting for monitor entry [0x00007f8bf207a000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89)
> - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader)
> at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
> at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
> at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260)
> at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
> at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255)
> - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job)
> at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
> at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
> at org.apache.oozie.command.XCommand.call(XCommand.java:287)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> And the thread which tries to connect to wrong NameNode address which has
> acquired the lock and keeps on retrying to connect to NameNode for ever.
> {code}
> "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469
> waiting on condition [0x00007f8bf1f78000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666)
> - locked <0x0000000083b80360> (a
> org.apache.hadoop.ipc.Client$Connection)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745)
> - locked <0x0000000083b80360> (a
> org.apache.hadoop.ipc.Client$Connection)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618)
> at org.apache.hadoop.ipc.Client.call(Client.java:1449)
> at org.apache.hadoop.ipc.Client.call(Client.java:1396)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
> at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
> at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443)
> at
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124)
> at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317)
> at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> - locked <0x0000000083b422f8> (a java.lang.Object)
> at
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> - locked <0x0000000083c498f0> (a java.lang.Object)
> at
> org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> - locked <0x0000000083c49950> (a java.lang.Object)
> at
> org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98)
> at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112)
> at
> org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34)
> at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95)
> - locked <0x0000000081b29098> (a java.util.ServiceLoader)
> at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
> at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
> at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475)
> at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454)
> at
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526)
> at
> org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
> at
> org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137)
> at
> org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232)
> at
> org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63)
> at org.apache.oozie.command.XCommand.call(XCommand.java:287)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331)
> at
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Oozie logs shows the job is retrying to connect to wrong namenode address.
> {code}
> 2017-05-10 05:38:23,194 INFO Client:904 - SERVER[prabhu2] Retrying connect
> to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:24,574 INFO Client:904 - SERVER[prabhu2] Retrying connect
> to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:26,918 INFO Client:904 - SERVER[prabhu2] Retrying connect
> to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> 2017-05-10 05:38:28,659 INFO Client:904 - SERVER[prabhu2] Retrying connect
> to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy
> is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
> {code}
> There should be same way to prevent Oozie Server to fall into this trap when
> some user has wrong details.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)