Re: Slider 0.92 on CDH 5.5.1 (Hadoop 2.6) - AM log shows NPE at component heardbeat URI

2017-05-09 Thread Manoj Samel
I think I found out what causes the NPE above in .92 and why it works in
version 0.80

The component name (a.k.a. role name) is "solo___super" i.e. it has 3 "_"

In 0.92, it seems a new concept of "Role Group" is introduced, which was
not present in 0.80.

In  0.92 - AgentProviderService.java

  private static final String LABEL_MAKER = "___";
  ...
  private String getRoleName(String label) {
int index1 = label.indexOf(LABEL_MAKER);
int index2 = label.lastIndexOf(LABEL_MAKER);
if (index1 == index2) {
  return label.substring(index1 + LABEL_MAKER.length());
} else {
  return label.substring(index1 + LABEL_MAKER.length(), index2);
}
  }

  private String getRoleGroup(String label) {
return label.substring(label.lastIndexOf(LABEL_MAKER) +
LABEL_MAKER.length());
  }

So when the real role name contains 3 "_" e.g. for "solo___super", the
getRoleName on container name will return just "solo" and not
"solo___super" and that bad role name can cause NPE

Same role name works in 0.80 because in 0.80, there is no concept of
roleGroup

In 0.80 - AgentProviderService.java

  private String getRoleName(String label) {
return label.substring(label.indexOf(LABEL_MAKER) +
LABEL_MAKER.length());
  }

so in 0.80, the role name "solo__super" will return correct role name from
container label


1) I tried to understand what the roleGroup is and whats its usage is but
could not locate any doc. Can someone give few lines of explanation ?
2) Should this be considered a bug in .92 ? If not, and if you think
LABEL_MAKER should not be used in any role names; at least a clear doc AND
a clear check when accepting config files will help. I.e. if LABEL_MAKER
should not be used in any role names; then slider 0.92 should give error
when creating cluster or accepting configs during any other operations etc.
saying invalid role name etc. etc.

Thanks in advance,


On Tue, Apr 11, 2017 at 6:09 PM, Manoj Samel 
wrote:

> Hi
>
> Running slider 0.92 on CDH 5.5.1 (which is Hadoop 2.6), with Kerberos
>
> I am deploying a application with multiple components. The components
> start but fail to heart beat to slider AM. The slider AM log shows NPE at
> container heartbeat URLs as below.
>
> I have attached the complete slider AM log
>
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO
>  agent.AgentProviderService - Handling registration: responseId=-1
> timestamp=1491957845550
> label=container_e95_1476898378926_91401_01_03___solo___super
> hostname=node1078
> expectedState=INIT
> actualState=INIT
> appVersion=null
>
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO
>  agent.AgentProviderService - label: 
> container_e95_1476898378926_91401_01_03___solo___super
> pkg: null
> 2017-04-12 00:44:05,741 [2011871076@qtp-814377348-5] INFO
>  agent.AgentProviderService - Registration response:
> RegistrationResponse{response=OK, responseId=0, statusCommands=null}
> 2017-04-12 00:44:05,871 [Socket Reader #1 for port 32120] INFO  ipc.Server
> - Auth successful for slideradmin@BIGDATA (auth:SIMPLE)
> 2017-04-12 00:44:05,873 [Socket Reader #1 for port 32120] INFO  
> authorize.ServiceAuthorizationManager
> - Authorization successful for slideradmin@BIGDATA (auth:TOKEN) for
> protocol=interface org.apache.slider.server.appmaster.rpc.
> SliderClusterProtocolPB
> 2017-04-12 00:44:15,749 [100585@qtp-814377348-7] ERROR mortbay.log -
> /ws/v1/slider/agents/container_e95_1476898378926_
> 91401_01_02___pdx__svt___ten85/heartbeat
> java.lang.NullPointerException
> at org.apache.slider.providers.agent.AgentProviderService.
> handleHeartBeat(AgentProviderService.java:1090)
> at org.apache.slider.server.appmaster.web.rest.agent.
> AgentResource.heartbeat(AgentResource.java:98)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(
> JavaMethodInvokerFactory.java:60)
> at com.sun.jersey.server.impl.model.method.dispatch.
> AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(
> AbstractResourceMethodDispatchProvider.java:185)
> at com.sun.jersey.server.impl.model.method.dispatch.
> ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.
> java:75)
> at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.
> accept(HttpMethodRule.java:288)
> at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.
> accept(RightHandPathRule.java:147)
> at com.sun.jersey.server.impl.uri.rules.SubLocatorRule.
> accept(SubLocatorRule.java:134)
> at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.
> accept(RightHandPathRule.java:147)
> at 

slider 0.92 - After upgrade, existing component keeps showing "awaiting heartbeat..."

2017-05-09 Thread Manoj Samel
Slider 0.92 on secured hadoop 2.6 cluster

Have a app with component "tenant1" running. Add another component
"tenant2" and do a upgrade. After that, the original component keeps
showing "awaiting heartbeat..." in the output of "slider list 
--containers" ... The component and AM log does not seem to have any error.

Is this a issue in slider 0.92 ? In slider 0.80, the existing components
eventually show the app version.

[root@node81 foo]# slider list foo --containers
2017-05-10 01:25:51,767 [main] INFO  tools.SliderUtils - JVM initialized
into secure mode with kerberos realm BIGDATA
foo   RUNNING  application_1493349422099_7053
   http://node80:23188/proxy/application_1493349422099_7053/
2017-05-10 01:25:52,846 [main] INFO  util.ExitUtil - Exiting with status 0
Containers:
  Component Name   App Version
  Container Id  Container Info/Logs
  slider-appmaster
container_e96_1493349422099_7053_02_01
http://node78:32123/node/container/container_e96_1493349422099_7053_02_01
  tenant21.0.0
container_e96_1493349422099_7053_02_02
http://node78:23999/node/container/container_e96_1493349422099_7053_02_02
  tenant1awaiting heartbeat...
container_e96_1493349422099_7053_01_02
http://node77:23999/node/container/container_e96_1493349422099_7053_01_02