Re: Slider AM fails to run when RM in HA setup fails over
Hello, I have uploaded requested logs, configurations and my observations on the logs etc. to https://issues.apache.org/jira/browse/SLIDER-1158. Would greatly appreciate if someone takes a looks and provides any pointers on slider created ticket and what could be leading to the observed behavior ? Thanks in advance, Manoj On Thu, Jul 28, 2016 at 7:01 PM, Manoj Samelwrote: > Hi Gour, > > I added properties in /etc/hadoop/conf/yarn-site.xml and emptied the > /data/slider/conf/slider-client.xml and restarted both RMs. > >- hadoop.registry.zk.quorum >- hadoop.registry.zk.root >- slider.yarn.queue > > Now there are no issues in creating or destroying cluster. This helps as > it keeps all configs in one location - thanks for the update. > > I am still hitting the original issue - Starting application with RM1 > active and then RM1 to RM2 fail over leads to slider-AM getting Client > cannot authenticate via:[TOKEN] errors. > > I will upload the config files soon ... > > Thanks, > > On Thu, Jul 28, 2016 at 5:28 PM, Manoj Samel > wrote: > >> Thanks. I will test with the updated config and then upload the latest >> ones ... >> >> Thanks, >> >> Manoj >> >> On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha wrote: >> >>> slider.zookeeper.quorum is deprecated and should not be used. >>> hadoop.registry.zk.quorum is used instead and is typically defined in >>> yarn-site.xml. So is hadoop.registry.zk.root. >>> >>> It is not encouraged to specify slider.yarn.queue at the cluster config >>> level. Ideally it is best to specify the queue during the application >>> submission. So you can use --queue option with slider create cmd. You can >>> also set on the command line using -D slider.yarn.queue=<> during the >>> create call. If indeed all slider apps should go to one and only one >>> queue, then this prop can be specified in any one of the existing site >>> xml >>> files under /etc/hadoop/conf. >>> >>> -Gour >>> >>> On 7/28/16, 4:43 PM, "Manoj Samel" wrote: >>> >>> >Following slider specific properties are at present added in >>> >/data/slider/conf/slider-client.xml. If you think they should be picked >>> up >>> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in >>> >HADOOP_CONF_DIR >>> >should these be added ? >>> > >>> > - slider.zookeeper.quorum >>> > - hadoop.registry.zk.quorum >>> > - hadoop.registry.zk.root >>> > - slider.yarn.queue >>> > >>> > >>> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha >>> wrote: >>> > >>> >> That is strange, since it is indeed not required to contain anything >>> in >>> >> slider-client.xml (except ) if >>> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives >>> an >>> >> indication that there might be some issue with cluster configuration >>> >>based >>> >> on files solely under HADOOP_CONF_DIR to begin with. >>> >> >>> >> Suggest you to upload all the config files to the jira to help debug >>> >>this >>> >> further. >>> >> >>> >> -Gour >>> >> >>> >> On 7/28/16, 4:27 PM, "Manoj Samel" wrote: >>> >> >>> >> >Thanks Gour for prompt reply >>> >> > >>> >> >BTW - Creating a empty slider-client.xml (with just >>> >> >) does not works. The AM starts but >>> >>fails >>> >> >to >>> >> >create any components and shows errors like >>> >> > >>> >> >2016-07-28 23:18:46,018 >>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN >>> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected >>> error, >>> >> >closing socket connection and attempting reconnect >>> >> >java.net.ConnectException: Connection refused >>> >> >at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>> >> >at >>> >> >>> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) >>> >> >at >>> >> >>> >>> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO >>> >>>.j >>> >> >ava:361) >>> >> >at >>> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) >>> >> > >>> >> >Also, command "slider destroy " fails with zookeeper errors ... >>> >> > >>> >> >I had to keep a minimal slider-client.xml. It does not have any RM >>> info >>> >> >etc. but does contain slider ZK related properties like >>> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum", >>> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute >>> minimal >>> >> >set >>> >> >of properties required, but this should suffice for now. All RM / >>> HDFS >>> >> >properties will be read from HADOOP_CONF_DIR files. >>> >> > >>> >> >Let me know if this could cause any issues. >>> >> > >>> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha >>> >>wrote: >>> >> > >>> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to >>> >>/etc/hadoop/conf >>> >> >>is >>> >> >> good. >>> >> >> >>> >> >> -Gour >>> >> >> >>> >> >> On 7/28/16,
[jira] [Created] (SLIDER-1162) Create a Docker Provider
Billie Rinaldi created SLIDER-1162: -- Summary: Create a Docker Provider Key: SLIDER-1162 URL: https://issues.apache.org/jira/browse/SLIDER-1162 Project: Slider Issue Type: New Feature Reporter: Billie Rinaldi Assignee: Billie Rinaldi Fix For: Slider 1.0.0 If we create an agent-less Docker Provider, then we will solve the current problem of executing the agent's python code inside the docker container. Other problems will be created, as we will have to move some of the agent's tasks elsewhere. YARN-5430 will be helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SLIDER-1161) Improve regionserver status check in HBase Slider app package
Sandeep Nemuri created SLIDER-1161: -- Summary: Improve regionserver status check in HBase Slider app package Key: SLIDER-1161 URL: https://issues.apache.org/jira/browse/SLIDER-1161 Project: Slider Issue Type: Improvement Components: app-package Affects Versions: Slider 0.80 Environment: RHEL-6 (64 Bit) Reporter: Sandeep Nemuri *PROBLEM* : Using slider for launching Hbase containers. Following is the problem statement and details : 1. Assume region server went into a big pause and lost its heartbeat with zookeeper 2. HMaster notices this and marks the region server as DEAD 3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 4. After that big delay, region server finally recovers and goes to HMaster 5. HMaster informs region server YouAreAlreadyDeadException 6. Now, this region server brings itself down and slider also notices that process is no longer running. 7. Slider now launches a new region server. The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SLIDER-1161) Improve regionserver status check in HBase Slider app package
[ https://issues.apache.org/jira/browse/SLIDER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Nemuri updated SLIDER-1161: --- Description: *PROBLEM* : Using slider for launching Hbase containers. Following is the problem statement and details : 1. Assume region server went into a big pause and lost its heartbeat with zookeeper 2. HMaster notices this and marks the region server as DEAD 3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 4. After that big delay, region server finally recovers and goes to HMaster 5. HMaster informs region server YouAreAlreadyDeadException 6. Now, this region server brings itself down and slider also notices that process is no longer running. 7. Slider now launches a new region server. The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers. The issue can be solved if slider would sync up with HMaster to find whether region server is alive or not. That way, it would immediately know that HMaster has already marked a region server as dead and will then bring down the region server and launch a new one. was: *PROBLEM* : Using slider for launching Hbase containers. Following is the problem statement and details : 1. Assume region server went into a big pause and lost its heartbeat with zookeeper 2. HMaster notices this and marks the region server as DEAD 3. However, slider agent continues to 'ps' the region server process in every heartbeat.monitor.interval (45000ms in my case) and because it is just checking for region server process being alive, it does not consider it dead 4. After that big delay, region server finally recovers and goes to HMaster 5. HMaster informs region server YouAreAlreadyDeadException 6. Now, this region server brings itself down and slider also notices that process is no longer running. 7. Slider now launches a new region server. The issue as clearly mentioned in steps above is that there can be a huge delay between step 4 and 6. This means that we are now operating with lesser region servers and this puts more and more load on existing region servers. > Improve regionserver status check in HBase Slider app package > - > > Key: SLIDER-1161 > URL: https://issues.apache.org/jira/browse/SLIDER-1161 > Project: Slider > Issue Type: Improvement > Components: app-package >Affects Versions: Slider 0.80 > Environment: RHEL-6 (64 Bit) >Reporter: Sandeep Nemuri > > *PROBLEM* : > Using slider for launching Hbase containers. > Following is the problem statement and details : > 1. Assume region server went into a big pause and lost its heartbeat with > zookeeper > 2. HMaster notices this and marks the region server as DEAD > 3. However, slider agent continues to 'ps' the region server process in every > heartbeat.monitor.interval (45000ms in my case) and because it is just > checking for region server process being alive, it does not consider it dead > 4. After that big delay, region server finally recovers and goes to HMaster > 5. HMaster informs region server YouAreAlreadyDeadException > 6. Now, this region server brings itself down and slider also notices that > process is no longer running. > 7. Slider now launches a new region server. > The issue as clearly mentioned in steps above is that there can be a huge > delay between step 4 and 6. This means that we are now operating with lesser > region servers and this puts more and more load on existing region servers. > The issue can be solved if slider would sync up with HMaster to find whether > region server is alive or not. That way, it would immediately know that > HMaster has already marked a region server as dead and will then bring down > the region server and launch a new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)