[ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026911#comment-16026911 ]
Venkat Ranganathan commented on OOZIE-2887: ------------------------------------------- You can use the whitelist configuration parameters to control this. See oozie.service.HadoopAccessorService.jobTracker.whitelist and oozie.service.HadoopAccessorService.nameNode.whitelist > Oozie Server hangs when there is a user job has wrong namenode address > ----------------------------------------------------------------------- > > Key: OOZIE-2887 > URL: https://issues.apache.org/jira/browse/OOZIE-2887 > Project: Oozie > Issue Type: Bug > Components: core > Affects Versions: 4.3.0 > Reporter: Prabhu Joseph > Priority: Critical > > All the oozie jobs goes to PREP state when a user job tries to connect to > wrong namenode address by mistake. Analyzing the jstack, all the threads > which tries to submit job waiting to lock "java.util.ServiceLoader" > {code} > "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x00007f8c08734000 nid=0xb468 > waiting for monitor entry [0x00007f8bf207a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89) > - waiting to lock <0x0000000081b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255) > - locked <0x0000000082fd6b30> (a org.apache.hadoop.mapreduce.Job) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > And the thread which tries to connect to wrong NameNode address which has > acquired the lock and keeps on retrying to connect to NameNode for ever. > {code} > "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x00007f8c08736000 nid=0xb469 > waiting on condition [0x00007f8bf1f78000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666) > - locked <0x0000000083b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) > - locked <0x0000000083b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618) > at org.apache.hadoop.ipc.Client.call(Client.java:1449) > at org.apache.hadoop.ipc.Client.call(Client.java:1396) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy31.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) > at com.sun.proxy.$Proxy32.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2158) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1423) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1443) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.<init>(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:317) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:309) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > - locked <0x0000000083b422f8> (a java.lang.Object) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:199) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > - locked <0x0000000083c498f0> (a java.lang.Object) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.serviceStart(ResourceMgrDelegate.java:109) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > - locked <0x0000000083c49950> (a java.lang.Object) > at > org.apache.hadoop.mapred.ResourceMgrDelegate.<init>(ResourceMgrDelegate.java:98) > at org.apache.hadoop.mapred.YARNRunner.<init>(YARNRunner.java:112) > at > org.apache.hadoop.mapred.YarnClientProtocolProvider.create(YarnClientProtocolProvider.java:34) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:95) > - locked <0x0000000081b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75) > at org.apache.hadoop.mapred.JobClient.init(JobClient.java:475) > at org.apache.hadoop.mapred.JobClient.<init>(JobClient.java:454) > at > org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:526) > at > org.apache.oozie.service.HadoopAccessorService$3.run(HadoopAccessorService.java:524) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.oozie.service.HadoopAccessorService.createJobClient(HadoopAccessorService.java:524) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.createJobClient(JavaActionExecutor.java:1416) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1137) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Oozie logs shows the job is retrying to connect to wrong namenode address. > {code} > 2017-05-10 05:38:23,194 INFO Client:904 - SERVER[prabhu2] Retrying connect > to server: prabhu1/172.26.98.45:8020. Already tried 333 time(s); retry policy > is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] > 2017-05-10 05:38:24,574 INFO Client:904 - SERVER[prabhu2] Retrying connect > to server: prabhu1/172.26.98.45:8020. Already tried 334 time(s); retry policy > is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] > 2017-05-10 05:38:26,918 INFO Client:904 - SERVER[prabhu2] Retrying connect > to server: prabhu1/172.26.98.45:8020. Already tried 335 time(s); retry policy > is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] > 2017-05-10 05:38:28,659 INFO Client:904 - SERVER[prabhu2] Retrying connect > to server: prabhu1/172.26.98.45:8020. Already tried 336 time(s); retry policy > is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail] > {code} > There should be same way to prevent Oozie Server to fall into this trap when > some user has wrong details. -- This message was sent by Atlassian JIRA (v6.3.15#6346)