[ 
https://issues.apache.org/jira/browse/SAMZA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadish updated SAMZA-1079:
----------------------------
       Assignee: Jagadish
    Description: 
Localizing refers to downloading of resources that a container needs to 
execute. This could include executables (binaries, jar files etc.) or other 
resource files that a container needs when it runs. The NM interacts with the 
HttpFileSystem to fetch the resources.

When there are flaky connection issues to the HttpFileSystem, we should 
graciously fail localizing with a timeout (instead of hanging the localizing 
phase forever). At LinkedIn, we have encountered issues with several jobs in 
our cluster hanging indefinitely. This error is very subtle because Yarn 
localization happens in a separate process called "ContainerLocalizer".

Based on investigation here are the relevant stack traces:
{code}
"ContainerLocalizer Downloader" #27 prio=5 os_prio=0 tid=0x00007fa8252f6000 
nid=0x49b6 runnable [0x00007fa7b959d000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)...
    - locked <0x000000008022ca40> (a java.io.BufferedInputStream)
    at 
org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:143)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)...
    at 
org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:88)
    at 
org.apache.samza.util.hadoop.HttpInputStream.read(HttpInputStream.scala:39)
    - locked <0x000000008022db10> (a java.lang.Object)...
    at java.lang.Thread.run(Thread.java:745)

{code}

Investigating heap dumps of the NM and the state of its data-structures 
revealed a hung socket.
{code}
java      18781  app  206r  IPv6 2460015845      0t0  TCP 
ltx1-app0228.stg.linkedin.com:60981->ltx1-artifactory-vip-2.stg.linkedin.com:8081
 (ESTABLISHED)
{code}

The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer are 
blocked waiting for the ContainerLocalizer to finish download. (This is not 
surprising since the pipe with the child process has not yet closed and there 
is no new data to read).
{code}
         "LocalizerRunner for container_e03_1481261762048_0541_02_000060" 
#2335967 prio=5 os_prio=0 tid=0x00007f993c913800 nid=0x4fa4 runnable 
[0x00007f9929d6f000]
java.lang.Thread.State: RUNNABLE
     at java.io.FileInputStream.readBytes(Native Method)
     at java.io.FileInputStream.read(FileInputStream.java:255)..
     - locked <0x00000000c7185be0> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
     at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)..
     at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
     at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
{code}

The fix is as follows:

Fix the HttpFileSystem to provide timeouts for read calls. This will cause the 
NM thread stuck on reading from the STDOUT of ContainerLocalizer to be 
interrupted (since the other end of the pipe is now closed). It will later 
trigger an AM notification for a killed container and the AM can make a new 
request to the RM for that container.   

The fix must be tested carefully since this is on the critical path of every 
single container request.

> HttpFileSystem should timeout for blocking reads when localizing containers.
> ----------------------------------------------------------------------------
>
>                 Key: SAMZA-1079
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1079
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Jagadish
>
> Localizing refers to downloading of resources that a container needs to 
> execute. This could include executables (binaries, jar files etc.) or other 
> resource files that a container needs when it runs. The NM interacts with the 
> HttpFileSystem to fetch the resources.
> When there are flaky connection issues to the HttpFileSystem, we should 
> graciously fail localizing with a timeout (instead of hanging the localizing 
> phase forever). At LinkedIn, we have encountered issues with several jobs in 
> our cluster hanging indefinitely. This error is very subtle because Yarn 
> localization happens in a separate process called "ContainerLocalizer".
> Based on investigation here are the relevant stack traces:
> {code}
> "ContainerLocalizer Downloader" #27 prio=5 os_prio=0 tid=0x00007fa8252f6000 
> nid=0x49b6 runnable [0x00007fa7b959d000]
>    java.lang.Thread.State: RUNNABLE
>     at java.net.SocketInputStream.socketRead0(Native Method)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)...
>     - locked <0x000000008022ca40> (a java.io.BufferedInputStream)
>     at 
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:143)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)...
>     at 
> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:88)
>     at 
> org.apache.samza.util.hadoop.HttpInputStream.read(HttpInputStream.scala:39)
>     - locked <0x000000008022db10> (a java.lang.Object)...
>     at java.lang.Thread.run(Thread.java:745)
> {code}
> Investigating heap dumps of the NM and the state of its data-structures 
> revealed a hung socket.
> {code}
> java      18781  app  206r  IPv6 2460015845      0t0  TCP 
> ltx1-app0228.stg.linkedin.com:60981->ltx1-artifactory-vip-2.stg.linkedin.com:8081
>  (ESTABLISHED)
> {code}
> The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer 
> are blocked waiting for the ContainerLocalizer to finish download. (This is 
> not surprising since the pipe with the child process has not yet closed and 
> there is no new data to read).
> {code}
>          "LocalizerRunner for container_e03_1481261762048_0541_02_000060" 
> #2335967 prio=5 os_prio=0 tid=0x00007f993c913800 nid=0x4fa4 runnable 
> [0x00007f9929d6f000]
> java.lang.Thread.State: RUNNABLE
>      at java.io.FileInputStream.readBytes(Native Method)
>      at java.io.FileInputStream.read(FileInputStream.java:255)..
>      - locked <0x00000000c7185be0> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
>      at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
>      at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)..
>      at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
>      at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> The fix is as follows:
> Fix the HttpFileSystem to provide timeouts for read calls. This will cause 
> the NM thread stuck on reading from the STDOUT of ContainerLocalizer to be 
> interrupted (since the other end of the pipe is now closed). It will later 
> trigger an AM notification for a killed container and the AM can make a new 
> request to the RM for that container.   
> The fix must be tested carefully since this is on the critical path of every 
> single container request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to