[ 
https://issues.apache.org/jira/browse/SAMZA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609490#comment-16609490
 ] 

ASF GitHub Bot commented on SAMZA-1824:
---------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/samza/pull/615


> Samza AM does not handle some failures during container launch
> --------------------------------------------------------------
>
>                 Key: SAMZA-1824
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1824
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Priority: Major
>
> I noticed this behavior in the AM today where it fails to allocate a new 
> container on failure to start.  The AM seems to get a callback for a 
> container failure during container startup and is never rescheduled again by 
> the AM. The logs seem to suggest that the request was made by the AM to start 
> container `container_e20_1528615592911_1987_02_000062` as seen below:
>  {code}
> 2018-08-06 15:39:46.369 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Received launch request for 12 on hostname 
> lca1-app0596.stg.linkedin.com
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Got available container ID (12) for 
> container: Container: [ContainerId: 
> container_e20_1528615592911_1987_02_000062, NodeId: 
> lca1-app0596.stg.linkedin.com:1158, NodeHttpAddress: 
> lca1-app0596.stg.linkedin.com:8042, Resource: <memory:4096, vCores:1>, 
> Priority: 1, Token: Token { kind: ContainerToken, service: 
> 10.251.166.210:1158 }, ]
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] In runContainer in util: fwkPath= 
> ;cmdPath=./__package/;jobLib=
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Container ID 12 using command 
> ./__package//bin/run-container.sh
> 2018-08-06 15:39:46.975 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Container ID 12 using environment variables:
> SAMZA_CONTAINER_ID=12
> EXECUTION_ENV_CONTAINER_ID=container_e20_1528615592911_1987_02_000062
> SAMZA_COORDINATOR_URL=http://lca1-app1576.stg.linkedin.com:44222/
> JAVA_OPTS=-Xmx3072m -Dcom.linkedin.app.name=ad-web-analytics-event-aggregator 
> -Dcom.linkedin.app.env=ei-lca1
>  {code}
> But due to the issue where the localization phase fails (currently under 
> investigation) the container fails to start, throwing the following exception 
> to the `onStartContainerError` callback in `YarnClusterResourceManager`.
>  {code}
> 2018-08-06 15:46:15.257 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #6] 
> YarnClusterResourceManager [ERROR] Container: 
> container_e20_1528615592911_1987_02_000062 could not start.
> java.net.ConnectException: Call From 
> lca1-app1576.stg.linkedin.com/10.251.174.104 to 
> lca1-app0596.stg.linkedin.com:1158 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>     at sun.reflect.GeneratedConstructorAccessor374.newInstance(Unknown Source)
>     at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
>     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1473)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1400)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>     at com.sun.proxy.$Proxy34.startContainers(Unknown Source)
>     at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
>     at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>     at com.sun.proxy.$Proxy35.startContainers(Unknown Source)
>     at 
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201)
>     at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:377)
>     at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:363)
>     at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>     at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>     at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>     at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>     at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer.handle(NMClientAsyncImpl.java:498)
>     at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:557)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused
>     at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>     at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
>     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
>     at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1439)
>     ... 22 more
> 2018-08-06 15:46:15.259 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #6] 
> YarnClusterResourceManager [INFO] Got an invalid notification for container: 
> container_e20_1528615592911_1987_02_000062
>  {code}
> Looking at the code in `onStartContainerError`: it seems like we are picking 
> a container from a Map called ` containersPendingStartup` which doesn’t seem 
> to have valid containers (quick code search did not show when this map is 
> populated). This causes these failure callbacks to go unaddressed leaving the 
> job to have containers that are always pending and the only resolution is to 
> bounce the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to