[jira] [Commented] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address
[ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741422#comment-16741422 ] duan xiong commented on OOZIE-2887: --- [~Prabhu Joseph] Hi,When we use Hadoop HA,Hadoop can provide one public domain name,For examples: hdfs://hadoop.com,Then when we submit job, we can use this value replace nn1,nn2, this method can avoid this problem. > Oozie Server hangs when there is a user job has wrong namenode address > --- > > Key: OOZIE-2887 > URL: https://issues.apache.org/jira/browse/OOZIE-2887 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: 4.3.0 >Reporter: Prabhu Joseph >Priority: Critical > > All the oozie jobs goes to PREP state when a user job tries to connect to > wrong namenode address by mistake. Analyzing the jstack, all the threads > which tries to submit job waiting to lock "java.util.ServiceLoader" > {code} > "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x7f8c08734000 nid=0xb468 > waiting for monitor entry [0x7f8bf207a000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89) > - waiting to lock <0x81b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255) > - locked <0x82fd6b30> (a org.apache.hadoop.mapreduce.Job) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > And the thread which tries to connect to wrong NameNode address which has > acquired the lock and keeps on retrying to connect to NameNode for ever. > {code} > "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x7f8c08736000 nid=0xb469 > waiting on condition [0x7f8bf1f78000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618) > at org.apache.hadoop.ipc.Client.call(Client.java:1449) > at
[jira] [Commented] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address
[ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112427#comment-16112427 ] Prabhu Joseph commented on OOZIE-2887: -- The issue happens even with job.properties having the correct namenode address when the NameNode nn1 machine is down. WhiteList configuration does not help here. {code} Repro: NameNode HA - nn1, nn2 Shutdown nn1 yarn.timeline.service.enabled true Now all oozie jobs will go to PREP where one thread will keep on retrying to connect to nn1 node and other threads waiting to lock the object. {code} > Oozie Server hangs when there is a user job has wrong namenode address > --- > > Key: OOZIE-2887 > URL: https://issues.apache.org/jira/browse/OOZIE-2887 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: 4.3.0 >Reporter: Prabhu Joseph >Priority: Critical > > All the oozie jobs goes to PREP state when a user job tries to connect to > wrong namenode address by mistake. Analyzing the jstack, all the threads > which tries to submit job waiting to lock "java.util.ServiceLoader" > {code} > "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x7f8c08734000 nid=0xb468 > waiting for monitor entry [0x7f8bf207a000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89) > - waiting to lock <0x81b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255) > - locked <0x82fd6b30> (a org.apache.hadoop.mapreduce.Job) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > And the thread which tries to connect to wrong NameNode address which has > acquired the lock and keeps on retrying to connect to NameNode for ever. > {code} > "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x7f8c08736000 nid=0xb469 > waiting on condition [0x7f8bf1f78000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) >
[jira] [Commented] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address
[ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089526#comment-16089526 ] Daniel Becker commented on OOZIE-2887: -- I couldn't reproduce it on pseudo-cluster, the issue might be in Hadoop retry logic. > Oozie Server hangs when there is a user job has wrong namenode address > --- > > Key: OOZIE-2887 > URL: https://issues.apache.org/jira/browse/OOZIE-2887 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: 4.3.0 >Reporter: Prabhu Joseph >Priority: Critical > > All the oozie jobs goes to PREP state when a user job tries to connect to > wrong namenode address by mistake. Analyzing the jstack, all the threads > which tries to submit job waiting to lock "java.util.ServiceLoader" > {code} > "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x7f8c08734000 nid=0xb468 > waiting for monitor entry [0x7f8bf207a000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89) > - waiting to lock <0x81b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255) > - locked <0x82fd6b30> (a org.apache.hadoop.mapreduce.Job) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > And the thread which tries to connect to wrong NameNode address which has > acquired the lock and keeps on retrying to connect to NameNode for ever. > {code} > "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x7f8c08736000 nid=0xb469 > waiting on condition [0x7f8bf1f78000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618) > at org.apache.hadoop.ipc.Client.call(Client.java:1449) > at org.apache.hadoop.ipc.Client.call(Client.java:1396) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at
[jira] [Commented] (OOZIE-2887) Oozie Server hangs when there is a user job has wrong namenode address
[ https://issues.apache.org/jira/browse/OOZIE-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026911#comment-16026911 ] Venkat Ranganathan commented on OOZIE-2887: --- You can use the whitelist configuration parameters to control this. See oozie.service.HadoopAccessorService.jobTracker.whitelist and oozie.service.HadoopAccessorService.nameNode.whitelist > Oozie Server hangs when there is a user job has wrong namenode address > --- > > Key: OOZIE-2887 > URL: https://issues.apache.org/jira/browse/OOZIE-2887 > Project: Oozie > Issue Type: Bug > Components: core >Affects Versions: 4.3.0 >Reporter: Prabhu Joseph >Priority: Critical > > All the oozie jobs goes to PREP state when a user job tries to connect to > wrong namenode address by mistake. Analyzing the jstack, all the threads > which tries to submit job waiting to lock "java.util.ServiceLoader" > {code} > "pool-2-thread-19" #47 prio=5 os_prio=0 tid=0x7f8c08734000 nid=0xb468 > waiting for monitor entry [0x7f8bf207a000] >java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:89) > - waiting to lock <0x81b29098> (a java.util.ServiceLoader) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:82) > at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:75) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1260) > at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1256) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.mapreduce.Job.connect(Job.java:1255) > - locked <0x82fd6b30> (a org.apache.hadoop.mapreduce.Job) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1284) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) > at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.submitLauncher(JavaActionExecutor.java:1187) > at > org.apache.oozie.action.hadoop.JavaActionExecutor.start(JavaActionExecutor.java:1373) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:232) > at > org.apache.oozie.command.wf.ActionStartXCommand.execute(ActionStartXCommand.java:63) > at org.apache.oozie.command.XCommand.call(XCommand.java:287) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:331) > at > org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:260) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:178) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > And the thread which tries to connect to wrong NameNode address which has > acquired the lock and keeps on retrying to connect to NameNode for ever. > {code} > "pool-2-thread-20" #48 prio=5 os_prio=0 tid=0x7f8c08736000 nid=0xb469 > waiting on condition [0x7f8bf1f78000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:899) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:666) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:745) > - locked <0x83b80360> (a > org.apache.hadoop.ipc.Client$Connection) > at > org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1618) > at org.apache.hadoop.ipc.Client.call(Client.java:1449) > at org.apache.hadoop.ipc.Client.call(Client.java:1396) >