[jira] [Commented] (SAMZA-1824) Samza AM does not handle some failures during container launch

2018-09-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SAMZA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609490#comment-16609490
 ] 

ASF GitHub Bot commented on SAMZA-1824:
---

Github user asfgit closed the pull request at:

https://github.com/apache/samza/pull/615


> Samza AM does not handle some failures during container launch
> --
>
> Key: SAMZA-1824
> URL: https://issues.apache.org/jira/browse/SAMZA-1824
> Project: Samza
>  Issue Type: Bug
>Reporter: Jagadish
>Priority: Major
>
> I noticed this behavior in the AM today where it fails to allocate a new 
> container on failure to start.  The AM seems to get a callback for a 
> container failure during container startup and is never rescheduled again by 
> the AM. The logs seem to suggest that the request was made by the AM to start 
> container `container_e20_1528615592911_1987_02_62` as seen below:
>  {code}
> 2018-08-06 15:39:46.369 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Received launch request for 12 on hostname 
> lca1-app0596.stg.linkedin.com
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Got available container ID (12) for 
> container: Container: [ContainerId: 
> container_e20_1528615592911_1987_02_62, NodeId: 
> lca1-app0596.stg.linkedin.com:1158, NodeHttpAddress: 
> lca1-app0596.stg.linkedin.com:8042, Resource: , 
> Priority: 1, Token: Token { kind: ContainerToken, service: 
> 10.251.166.210:1158 }, ]
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] In runContainer in util: fwkPath= 
> ;cmdPath=./__package/;jobLib=
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Container ID 12 using command 
> ./__package//bin/run-container.sh
> 2018-08-06 15:39:46.975 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Container ID 12 using environment variables:
> SAMZA_CONTAINER_ID=12
> EXECUTION_ENV_CONTAINER_ID=container_e20_1528615592911_1987_02_62
> SAMZA_COORDINATOR_URL=http://lca1-app1576.stg.linkedin.com:44222/
> JAVA_OPTS=-Xmx3072m -Dcom.linkedin.app.name=ad-web-analytics-event-aggregator 
> -Dcom.linkedin.app.env=ei-lca1
>  {code}
> But due to the issue where the localization phase fails (currently under 
> investigation) the container fails to start, throwing the following exception 
> to the `onStartContainerError` callback in `YarnClusterResourceManager`.
>  {code}
> 2018-08-06 15:46:15.257 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #6] 
> YarnClusterResourceManager [ERROR] Container: 
> container_e20_1528615592911_1987_02_62 could not start.
> java.net.ConnectException: Call From 
> lca1-app1576.stg.linkedin.com/10.251.174.104 to 
> lca1-app0596.stg.linkedin.com:1158 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.GeneratedConstructorAccessor374.newInstance(Unknown Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
> at org.apache.hadoop.ipc.Client.call(Client.java:1473)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy34.startContainers(Unknown Source)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
> at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy35.startContainers(Unknown Source)
> at 
> org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:201)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:377)
> at 
> org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl$StatefulContainer$StartContainerTransition.transition(NMClientAsyncImpl.java:363)
> at 
> 

[jira] [Commented] (SAMZA-1824) Samza AM does not handle some failures during container launch

2018-08-24 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SAMZA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591946#comment-16591946
 ] 

ASF GitHub Bot commented on SAMZA-1824:
---

GitHub user vjagadish1989 opened a pull request:

https://github.com/apache/samza/pull/615

SAMZA-1824: Handle errors from the async-NMClient when launching containers

- Updated internal state that tracks "pending" containers correctly
- Refactored `YarnClusterResourceManager` for testability. Add an unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vjagadish1989/samza container-launch-error

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/samza/pull/615.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #615


commit ad5436a4bdce69fb4ed072ddb8882aa631d2ddb6
Author: Jagadish 
Date:   2018-05-30T22:06:03Z

Add logging for EventHubs configs

commit 4976d2157bdd9146c7df8950eed46dce26bb95de
Author: Jagadish 
Date:   2018-05-30T22:49:46Z

Fix a checkstyle failure

commit 85de1b36752597bb4e147099c583dc6aeaef8eb6
Author: Jagadish 
Date:   2018-06-05T19:17:48Z

Merge branch 'master' of https://github.com/apache/samza

commit 2124d4ed1d9f6fcca753ce3fb3c74ceae097f616
Author: Jagadish 
Date:   2018-06-08T18:00:16Z

Merge branch 'master' of https://github.com/apache/samza

commit 3d03e3b9af7094bc7026922d22b3104943dd4343
Author: Jagadish 
Date:   2018-06-11T17:49:01Z

Merge branch 'master' of https://github.com/apache/samza

commit 7db95735d9e517193edfc594ba11d05eefeefc3f
Author: Jagadish 
Date:   2018-06-14T03:45:53Z

Add metric for effectiveness of host-affinity

commit 86332640e380817d4515f82afa9f893c0bb82976
Author: Jagadish 
Date:   2018-06-19T00:01:14Z

Merge branch 'master' of https://github.com/apache/samza

commit ba7861b1182fe9442b10a7c1f4993da83efda9d9
Author: Jagadish 
Date:   2018-07-24T01:28:43Z

Merge branch 'master' of https://github.com/apache/samza

commit de604d5c1f7a087be40051fd1615089194c22945
Author: Jagadish 
Date:   2018-07-24T01:33:38Z

Minor: Disable flaky samza-yarn test. Tracked in SAMZA-1781

commit baf51c08b45b7c22a0d4cfaea57f0ab2efffc84a
Author: Jagadish 
Date:   2018-07-30T16:26:45Z

Merge branch 'master' of https://github.com/apache/samza

commit 969d69ac33566b265fd7e5a64e38ffd39eb95510
Author: Jagadish 
Date:   2018-08-08T00:39:28Z

Merge branch 'master' of https://github.com/apache/samza

commit 1beb34d9c2fd9c0aed9b4d23c0963f101b76322e
Author: Jagadish 
Date:   2018-08-16T00:42:05Z

Merge branch 'master' of https://github.com/apache/samza

commit c95ecbadd0d5acfc1b4457566df3141606d1e012
Author: Jagadish 
Date:   2018-08-24T02:05:26Z

Merge branch 'master' of https://github.com/apache/samza

commit cffe2d9d644d3a2bb4ed4d0e87e9f46974685c5f
Author: Jagadish 
Date:   2018-08-24T03:51:17Z

Handle errors during container launch

commit d11837509357212cc1bf85a1a9c670d5a8a75afc
Author: Jagadish 
Date:   2018-08-24T03:53:12Z

Add a unit test for verifying launch failures




> Samza AM does not handle some failures during container launch
> --
>
> Key: SAMZA-1824
> URL: https://issues.apache.org/jira/browse/SAMZA-1824
> Project: Samza
>  Issue Type: Bug
>Reporter: Jagadish
>Priority: Major
>
> I noticed this behavior in the AM today where it fails to allocate a new 
> container on failure to start.  The AM seems to get a callback for a 
> container failure during container startup and is never rescheduled again by 
> the AM. The logs seem to suggest that the request was made by the AM to start 
> container `container_e20_1528615592911_1987_02_62` as seen below:
>  {code}
> 2018-08-06 15:39:46.369 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Received launch request for 12 on hostname 
> lca1-app0596.stg.linkedin.com
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Got available container ID (12) for 
> container: Container: [ContainerId: 
> container_e20_1528615592911_1987_02_62, NodeId: 
> lca1-app0596.stg.linkedin.com:1158, NodeHttpAddress: 
> lca1-app0596.stg.linkedin.com:8042, Resource: , 
> Priority: 1, Token: Token { kind: ContainerToken, service: 
> 10.251.166.210:1158 }, ]
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] In runContainer in util: fwkPath= 
> ;cmdPath=./__package/;jobLib=
> 2018-08-06 15:39:46.974 [Container Allocator Thread] 
> YarnClusterResourceManager [INFO] Container ID 12 using command 
> ./__package//bin/run-container.sh
> 2018-08-06 15:39:46.975 [Container Allocator Thread] 
>