[ https://issues.apache.org/jira/browse/SAMZA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591946#comment-16591946 ]
ASF GitHub Bot commented on SAMZA-1824: --------------------------------------- GitHub user vjagadish1989 opened a pull request: https://github.com/apache/samza/pull/615 SAMZA-1824: Handle errors from the async-NMClient when launching containers - Updated internal state that tracks "pending" containers correctly - Refactored `YarnClusterResourceManager` for testability. Add an unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/vjagadish1989/samza container-launch-error Alternatively you can review and apply these changes as the patch at: https://github.com/apache/samza/pull/615.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #615 ---- commit ad5436a4bdce69fb4ed072ddb8882aa631d2ddb6 Author: Jagadish <jvenkatraman@...> Date: 2018-05-30T22:06:03Z Add logging for EventHubs configs commit 4976d2157bdd9146c7df8950eed46dce26bb95de Author: Jagadish <jvenkatraman@...> Date: 2018-05-30T22:49:46Z Fix a checkstyle failure commit 85de1b36752597bb4e147099c583dc6aeaef8eb6 Author: Jagadish <jvenkatraman@...> Date: 2018-06-05T19:17:48Z Merge branch 'master' of https://github.com/apache/samza commit 2124d4ed1d9f6fcca753ce3fb3c74ceae097f616 Author: Jagadish <jvenkatraman@...> Date: 2018-06-08T18:00:16Z Merge branch 'master' of https://github.com/apache/samza commit 3d03e3b9af7094bc7026922d22b3104943dd4343 Author: Jagadish <jvenkatraman@...> Date: 2018-06-11T17:49:01Z Merge branch 'master' of https://github.com/apache/samza commit 7db95735d9e517193edfc594ba11d05eefeefc3f Author: Jagadish <jvenkatraman@...> Date: 2018-06-14T03:45:53Z Add metric for effectiveness of host-affinity commit 86332640e380817d4515f82afa9f893c0bb82976 Author: Jagadish <jvenkatraman@...> Date: 2018-06-19T00:01:14Z Merge branch 'master' of https://github.com/apache/samza commit ba7861b1182fe9442b10a7c1f4993da83efda9d9 Author: Jagadish <jvenkatraman@...> Date: 2018-07-24T01:28:43Z Merge branch 'master' of https://github.com/apache/samza commit de604d5c1f7a087be40051fd1615089194c22945 Author: Jagadish <jvenkatraman@...> Date: 2018-07-24T01:33:38Z Minor: Disable flaky samza-yarn test. Tracked in SAMZA-1781 commit baf51c08b45b7c22a0d4cfaea57f0ab2efffc84a Author: Jagadish <jvenkatraman@...> Date: 2018-07-30T16:26:45Z Merge branch 'master' of https://github.com/apache/samza commit 969d69ac33566b265fd7e5a64e38ffd39eb95510 Author: Jagadish <jvenkatraman@...> Date: 2018-08-08T00:39:28Z Merge branch 'master' of https://github.com/apache/samza commit 1beb34d9c2fd9c0aed9b4d23c0963f101b76322e Author: Jagadish <jvenkatraman@...> Date: 2018-08-16T00:42:05Z Merge branch 'master' of https://github.com/apache/samza commit c95ecbadd0d5acfc1b4457566df3141606d1e012 Author: Jagadish <jvenkatraman@...> Date: 2018-08-24T02:05:26Z Merge branch 'master' of https://github.com/apache/samza commit cffe2d9d644d3a2bb4ed4d0e87e9f46974685c5f Author: Jagadish <jvenkatraman@...> Date: 2018-08-24T03:51:17Z Handle errors during container launch commit d11837509357212cc1bf85a1a9c670d5a8a75afc Author: Jagadish <jvenkatraman@...> Date: 2018-08-24T03:53:12Z Add a unit test for verifying launch failures ---- > Samza AM does not handle some failures during container launch > -------------------------------------------------------------- > > Key: SAMZA-1824 > URL: https://issues.apache.org/jira/browse/SAMZA-1824 > Project: Samza > Issue Type: Bug > Reporter: Jagadish > Priority: Major > > I noticed this behavior in the AM today where it fails to allocate a new > container on failure to start. The AM seems to get a callback for a > container failure during container startup and is never rescheduled again by > the AM. The logs seem to suggest that the request was made by the AM to start > container `container_e20_1528615592911_1987_02_000062` as seen below: > {code} > 2018-08-06 15:39:46.369 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Received launch request for 12 on hostname > lca1-app0596.stg.linkedin.com > 2018-08-06 15:39:46.974 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Got available container ID (12) for > container: Container: [ContainerId: > container_e20_1528615592911_1987_02_000062, NodeId: > lca1-app0596.stg.linkedin.com:1158, NodeHttpAddress: > lca1-app0596.stg.linkedin.com:8042, Resource: <memory:4096, vCores:1>, > Priority: 1, Token: Token { kind: ContainerToken, service: > 10.251.166.210:1158 }, ] > 2018-08-06 15:39:46.974 [Container Allocator Thread] > YarnClusterResourceManager [INFO] In runContainer in util: fwkPath= > ;cmdPath=./__package/;jobLib= > 2018-08-06 15:39:46.974 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Container ID 12 using command > ./__package//bin/run-container.sh > 2018-08-06 15:39:46.975 [Container Allocator Thread] > YarnClusterResourceManager [INFO] Container ID 12 using environment variables: > SAMZA_CONTAINER_ID=12 > EXECUTION_ENV_CONTAINER_ID=container_e20_1528615592911_1987_02_000062 > SAMZA_COORDINATOR_URL=http://lca1-app1576.stg.linkedin.com:44222/ > JAVA_OPTS=-Xmx3072m -Dcom.linkedin.app.name=ad-web-analytics-event-aggregator > -Dcom.linkedin.app.env=ei-lca1 > {code} > But due to the issue where the localization phase fails (currently under > investigation) the container fails to start, throwing the following exception > to the `onStartContainerError` callback in `YarnClusterResourceManager`. > {code} > 2018-08-06 15:46:15.257 > [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #6] > YarnClusterResourceManager [ERROR] Container: > container_e20_1528615592911_1987_02_000062 could not start. > java.net.ConnectException: Call From > lca1-app1576.stg.linkedin.com/10.251.174.104 to > lca1-app0596.stg.linkedin.com:1158 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.reflect.GeneratedConstructorAccessor374.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) > at org.apache.hadoop.ipc.Client.call(Client.java:1473) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy34.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy35.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201) > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:377) > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:363) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer.handle(NMClientAsyncImpl.java:498) > at > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$ContainerEventProcessor.run(NMClientAsyncImpl.java:557) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) > at org.apache.hadoop.ipc.Client.call(Client.java:1439) > ... 22 more > 2018-08-06 15:46:15.259 > [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #6] > YarnClusterResourceManager [INFO] Got an invalid notification for container: > container_e20_1528615592911_1987_02_000062 > {code} > Looking at the code in `onStartContainerError`: it seems like we are picking > a container from a Map called ` containersPendingStartup` which doesn’t seem > to have valid containers (quick code search did not show when this map is > populated). This causes these failure callbacks to go unaddressed leaving the > job to have containers that are always pending and the only resolution is to > bounce the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005)