[ 
https://issues.apache.org/jira/browse/SAMZA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829413#comment-15829413
 ] 

ASF GitHub Bot commented on SAMZA-1079:
---------------------------------------

GitHub user vjagadish opened a pull request:

    https://github.com/apache/samza/pull/42

    SAMZA-1079: Add timeouts for reads from HttpFileSystem. Add tests.

    * Wrote a unit/integration test to simulate a stuck connection when reading 
binaries for the job.
    Other misc. changes:
    - Moved some debug log messages to be info for better debugging.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vjagadish1989/samza http-fs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #42
    
----
commit fcefa7a6cae4bb995b465b4e342f708a335e92b1
Author: vjagadish1989 <[email protected]>
Date:   2017-01-19T05:58:59Z

    SAMZA-1079: Add timeouts for reads from HttpFileSystem. Add unit tests.
    Other misc. changes:
    - Moved some debug log messages to be info for better debugging.

----


> HttpFileSystem should timeout for blocking reads when localizing containers.
> ----------------------------------------------------------------------------
>
>                 Key: SAMZA-1079
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1079
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jagadish
>            Assignee: Jagadish
>
> Localizing refers to downloading of resources that a container needs to 
> execute. This could include executables (binaries, jar files etc.) or other 
> resource files that a container needs when it runs. The NM interacts with the 
> HttpFileSystem to fetch the resources.
> When there are flaky connection issues to the HttpFileSystem, we should 
> graciously fail localizing with a timeout (instead of hanging the localizing 
> phase forever). At LinkedIn, we have encountered issues with several jobs in 
> our cluster hanging indefinitely. This error is very subtle because Yarn 
> localization happens in a separate process called "ContainerLocalizer".
> Based on investigation here are the relevant stack traces:
> {code}
> "ContainerLocalizer Downloader" #27 prio=5 os_prio=0 tid=0x00007fa8252f6000 
> nid=0x49b6 runnable [0x00007fa7b959d000]
>    java.lang.Thread.State: RUNNABLE
>     at java.net.SocketInputStream.socketRead0(Native Method)
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)...
>     - locked <0x000000008022ca40> (a java.io.BufferedInputStream)
>     at 
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:143)
>     at java.io.FilterInputStream.read(FilterInputStream.java:83)...
>     at 
> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:88)
>     at 
> org.apache.samza.util.hadoop.HttpInputStream.read(HttpInputStream.scala:39)
>     - locked <0x000000008022db10> (a java.lang.Object)...
>     at java.lang.Thread.run(Thread.java:745)
> {code}
> Investigating heap dumps of the NM and the state of its data-structures 
> revealed a hung socket.
> {code}
> java      18781  app  206r  IPv6 zzz      0t0  TCP 
> ltx1-appzzz.stg.linkedin.com:nnnn->ltx1-artifactory.xxx.linkedin.com:nnnn 
> (ESTABLISHED)
> {code}
> The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer 
> are blocked waiting for the ContainerLocalizer to finish download. (This is 
> not surprising since the pipe with the child process has not yet closed and 
> there is no new data to read).
> {code}
>          "LocalizerRunner for container_e03_1481261762048_0541_02_000060" 
> #2335967 prio=5 os_prio=0 tid=0x00007f993c913800 nid=0x4fa4 runnable 
> [0x00007f9929d6f000]
> java.lang.Thread.State: RUNNABLE
>      at java.io.FileInputStream.readBytes(Native Method)
>      at java.io.FileInputStream.read(FileInputStream.java:255)..
>      - locked <0x00000000c7185be0> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
>      at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
>      at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)..
>      at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
>      at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> The fix is as follows:
> Fix the HttpFileSystem to provide timeouts for read calls. The socket time 
> out will cause the NM to shutdown the ContainerLocalizer. This will cause the 
> NM thread stuck on reading from the STDOUT of ContainerLocalizer to be 
> interrupted (since the other end of the pipe is now closed). It will later 
> trigger an AM notification for a killed container and the AM can make a new 
> request to the RM for that container.   
> The fix must be tested carefully since this is on the critical path of every 
> single container request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to