Re: Slider AM fails to run when RM in HA setup fails over

2016-08-01 Thread Manoj Samel
Hello,

I have uploaded requested logs, configurations and my observations on the
logs etc. to https://issues.apache.org/jira/browse/SLIDER-1158.

Would greatly appreciate if someone takes a looks and provides any pointers
on slider created ticket and what could be leading to the observed behavior
?

Thanks in advance,

Manoj

On Thu, Jul 28, 2016 at 7:01 PM, Manoj Samel 
wrote:

> Hi Gour,
>
> I added properties in /etc/hadoop/conf/yarn-site.xml and emptied the
> /data/slider/conf/slider-client.xml and restarted both RMs.
>
>- hadoop.registry.zk.quorum
>- hadoop.registry.zk.root
>- slider.yarn.queue
>
> Now there are no issues in creating or destroying cluster. This helps as
> it keeps all configs in one location - thanks for the update.
>
>  I am still hitting the original issue - Starting application with RM1
> active and then RM1 to RM2 fail over leads to slider-AM getting Client
> cannot authenticate via:[TOKEN] errors.
>
> I will upload the config files soon ...
>
> Thanks,
>
> On Thu, Jul 28, 2016 at 5:28 PM, Manoj Samel 
> wrote:
>
>> Thanks. I will test with the updated config and then upload the latest
>> ones ...
>>
>> Thanks,
>>
>> Manoj
>>
>> On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha  wrote:
>>
>>> slider.zookeeper.quorum is deprecated and should not be used.
>>> hadoop.registry.zk.quorum is used instead and is typically defined in
>>> yarn-site.xml. So is hadoop.registry.zk.root.
>>>
>>> It is not encouraged to specify slider.yarn.queue at the cluster config
>>> level. Ideally it is best to specify the queue during the application
>>> submission. So you can use --queue option with slider create cmd. You can
>>> also set on the command line using -D slider.yarn.queue=<> during the
>>> create call. If indeed all slider apps should go to one and only one
>>> queue, then this prop can be specified in any one of the existing site
>>> xml
>>> files under /etc/hadoop/conf.
>>>
>>> -Gour
>>>
>>> On 7/28/16, 4:43 PM, "Manoj Samel"  wrote:
>>>
>>> >Following slider specific properties are at present added in
>>> >/data/slider/conf/slider-client.xml. If you think they should be picked
>>> up
>>> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
>>> >HADOOP_CONF_DIR
>>> >should these be added ?
>>> >
>>> >   - slider.zookeeper.quorum
>>> >   - hadoop.registry.zk.quorum
>>> >   - hadoop.registry.zk.root
>>> >   - slider.yarn.queue
>>> >
>>> >
>>> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha 
>>> wrote:
>>> >
>>> >> That is strange, since it is indeed not required to contain anything
>>> in
>>> >> slider-client.xml (except ) if
>>> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives
>>> an
>>> >> indication that there might be some issue with cluster configuration
>>> >>based
>>> >> on files solely under HADOOP_CONF_DIR to begin with.
>>> >>
>>> >> Suggest you to upload all the config files to the jira to help debug
>>> >>this
>>> >> further.
>>> >>
>>> >> -Gour
>>> >>
>>> >> On 7/28/16, 4:27 PM, "Manoj Samel"  wrote:
>>> >>
>>> >> >Thanks Gour for prompt reply
>>> >> >
>>> >> >BTW - Creating a empty slider-client.xml (with just
>>> >> >) does not works. The AM starts but
>>> >>fails
>>> >> >to
>>> >> >create any components and shows errors like
>>> >> >
>>> >> >2016-07-28 23:18:46,018
>>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>>> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>>> error,
>>> >> >closing socket connection and attempting reconnect
>>> >> >java.net.ConnectException: Connection refused
>>> >> >at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> >> >at
>>> >>
>>> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>> >> >at
>>> >>
>>>
>>> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
>>> >>>.j
>>> >> >ava:361)
>>> >> >at
>>> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>>> >> >
>>> >> >Also, command "slider destroy " fails with zookeeper errors ...
>>> >> >
>>> >> >I had to keep a minimal slider-client.xml. It does not have any RM
>>> info
>>> >> >etc. but does contain slider ZK related properties like
>>> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>>> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute
>>> minimal
>>> >> >set
>>> >> >of properties required, but this should suffice for now. All RM /
>>> HDFS
>>> >> >properties will be read from HADOOP_CONF_DIR files.
>>> >> >
>>> >> >Let me know if this could cause any issues.
>>> >> >
>>> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha 
>>> >>wrote:
>>> >> >
>>> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
>>> >>/etc/hadoop/conf
>>> >> >>is
>>> >> >> good.
>>> >> >>
>>> >> >> -Gour
>>> >> >>
>>> >> >> On 7/28/16, 

[jira] [Created] (SLIDER-1162) Create a Docker Provider

2016-08-01 Thread Billie Rinaldi (JIRA)
Billie Rinaldi created SLIDER-1162:
--

 Summary: Create a Docker Provider
 Key: SLIDER-1162
 URL: https://issues.apache.org/jira/browse/SLIDER-1162
 Project: Slider
  Issue Type: New Feature
Reporter: Billie Rinaldi
Assignee: Billie Rinaldi
 Fix For: Slider 1.0.0


If we create an agent-less Docker Provider, then we will solve the current 
problem of executing the agent's python code inside the docker container. Other 
problems will be created, as we will have to move some of the agent's tasks 
elsewhere. YARN-5430 will be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SLIDER-1161) Improve regionserver status check in HBase Slider app package

2016-08-01 Thread Sandeep Nemuri (JIRA)
Sandeep Nemuri created SLIDER-1161:
--

 Summary: Improve regionserver status check in HBase Slider app 
package
 Key: SLIDER-1161
 URL: https://issues.apache.org/jira/browse/SLIDER-1161
 Project: Slider
  Issue Type: Improvement
  Components: app-package
Affects Versions: Slider 0.80
 Environment: RHEL-6 (64 Bit)
Reporter: Sandeep Nemuri


*PROBLEM* :

Using slider for launching Hbase containers.
Following is the problem statement and details :
1. Assume region server went into a big pause and lost its heartbeat with 
zookeeper 
2. HMaster notices this and marks the region server as DEAD 
3. However, slider agent continues to 'ps' the region server process in every 
heartbeat.monitor.interval (45000ms in my case) and because it is just checking 
for region server process being alive, it does not consider it dead 
4. After that big delay, region server finally recovers and goes to HMaster 
5. HMaster informs region server YouAreAlreadyDeadException 
6. Now, this region server brings itself down and slider also notices that 
process is no longer running. 
7. Slider now launches a new region server.

The issue as clearly mentioned in steps above is that there can be a huge delay 
between step 4 and 6. This means that we are now operating with lesser region 
servers and this puts more and more load on existing region servers.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SLIDER-1161) Improve regionserver status check in HBase Slider app package

2016-08-01 Thread Sandeep Nemuri (JIRA)

 [ 
https://issues.apache.org/jira/browse/SLIDER-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Nemuri updated SLIDER-1161:
---
Description: 
*PROBLEM* :

Using slider for launching Hbase containers.
Following is the problem statement and details :
1. Assume region server went into a big pause and lost its heartbeat with 
zookeeper 
2. HMaster notices this and marks the region server as DEAD 
3. However, slider agent continues to 'ps' the region server process in every 
heartbeat.monitor.interval (45000ms in my case) and because it is just checking 
for region server process being alive, it does not consider it dead 
4. After that big delay, region server finally recovers and goes to HMaster 
5. HMaster informs region server YouAreAlreadyDeadException 
6. Now, this region server brings itself down and slider also notices that 
process is no longer running. 
7. Slider now launches a new region server.

The issue as clearly mentioned in steps above is that there can be a huge delay 
between step 4 and 6. This means that we are now operating with lesser region 
servers and this puts more and more load on existing region servers.


The issue can be solved if slider would sync up with HMaster to find whether 
region server is alive or not. That way, it would immediately know that HMaster 
has already marked a region server as dead and will then bring down the region 
server and launch a new one.

  was:
*PROBLEM* :

Using slider for launching Hbase containers.
Following is the problem statement and details :
1. Assume region server went into a big pause and lost its heartbeat with 
zookeeper 
2. HMaster notices this and marks the region server as DEAD 
3. However, slider agent continues to 'ps' the region server process in every 
heartbeat.monitor.interval (45000ms in my case) and because it is just checking 
for region server process being alive, it does not consider it dead 
4. After that big delay, region server finally recovers and goes to HMaster 
5. HMaster informs region server YouAreAlreadyDeadException 
6. Now, this region server brings itself down and slider also notices that 
process is no longer running. 
7. Slider now launches a new region server.

The issue as clearly mentioned in steps above is that there can be a huge delay 
between step 4 and 6. This means that we are now operating with lesser region 
servers and this puts more and more load on existing region servers.



> Improve regionserver status check in HBase Slider app package
> -
>
> Key: SLIDER-1161
> URL: https://issues.apache.org/jira/browse/SLIDER-1161
> Project: Slider
>  Issue Type: Improvement
>  Components: app-package
>Affects Versions: Slider 0.80
> Environment: RHEL-6 (64 Bit)
>Reporter: Sandeep Nemuri
>
> *PROBLEM* :
> Using slider for launching Hbase containers.
> Following is the problem statement and details :
> 1. Assume region server went into a big pause and lost its heartbeat with 
> zookeeper 
> 2. HMaster notices this and marks the region server as DEAD 
> 3. However, slider agent continues to 'ps' the region server process in every 
> heartbeat.monitor.interval (45000ms in my case) and because it is just 
> checking for region server process being alive, it does not consider it dead 
> 4. After that big delay, region server finally recovers and goes to HMaster 
> 5. HMaster informs region server YouAreAlreadyDeadException 
> 6. Now, this region server brings itself down and slider also notices that 
> process is no longer running. 
> 7. Slider now launches a new region server.
> The issue as clearly mentioned in steps above is that there can be a huge 
> delay between step 4 and 6. This means that we are now operating with lesser 
> region servers and this puts more and more load on existing region servers.
> The issue can be solved if slider would sync up with HMaster to find whether 
> region server is alive or not. That way, it would immediately know that 
> HMaster has already marked a region server as dead and will then bring down 
> the region server and launch a new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)