[
https://issues.apache.org/jira/browse/SAMZA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jagadish updated SAMZA-1079:
----------------------------
Fix Version/s: 0.12.0
> HttpFileSystem should timeout for blocking reads when localizing containers.
> ----------------------------------------------------------------------------
>
> Key: SAMZA-1079
> URL: https://issues.apache.org/jira/browse/SAMZA-1079
> Project: Samza
> Issue Type: Bug
> Reporter: Jagadish
> Assignee: Jagadish
> Fix For: 0.12.0
>
>
> Localizing refers to downloading of resources that a container needs to
> execute. This could include executables (binaries, jar files etc.) or other
> resource files that a container needs when it runs. The NM interacts with the
> HttpFileSystem to fetch the resources.
> When there are flaky connection issues to the HttpFileSystem, we should
> graciously fail localizing with a timeout (instead of hanging the localizing
> phase forever). At LinkedIn, we have encountered issues with several jobs in
> our cluster hanging indefinitely. This error is very subtle because Yarn
> localization happens in a separate process called "ContainerLocalizer".
> Based on investigation here are the relevant stack traces:
> {code}
> "ContainerLocalizer Downloader" #27 prio=5 os_prio=0 tid=0x00007fa8252f6000
> nid=0x49b6 runnable [0x00007fa7b959d000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)...
> - locked <0x000000008022ca40> (a java.io.BufferedInputStream)
> at
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:143)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)...
> at
> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:88)
> at
> org.apache.samza.util.hadoop.HttpInputStream.read(HttpInputStream.scala:39)
> - locked <0x000000008022db10> (a java.lang.Object)...
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Investigating heap dumps of the NM and the state of its data-structures
> revealed a hung socket.
> {code}
> java 18781 app 206r IPv6 zzz 0t0 TCP
> ltx1-appzzz.stg.linkedin.com:nnnn->ltx1-artifactory.xxx.linkedin.com:nnnn
> (ESTABLISHED)
> {code}
> The NM threads that consume the STDOUT and STDERR of the ContainerLocalizer
> are blocked waiting for the ContainerLocalizer to finish download. (This is
> not surprising since the pipe with the child process has not yet closed and
> there is no new data to read).
> {code}
> "LocalizerRunner for container_e03_1481261762048_0541_02_000060"
> #2335967 prio=5 os_prio=0 tid=0x00007f993c913800 nid=0x4fa4 runnable
> [0x00007f9929d6f000]
> java.lang.Thread.State: RUNNABLE
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:255)..
> - locked <0x00000000c7185be0> (a
> java.lang.UNIXProcess$ProcessPipeInputStream)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:786)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:568)..
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:237)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1113)
> {code}
> The fix is as follows:
> Fix the HttpFileSystem to provide timeouts for read calls. The socket time
> out will cause the NM to shutdown the ContainerLocalizer. This will cause the
> NM thread stuck on reading from the STDOUT of ContainerLocalizer to be
> interrupted (since the other end of the pipe is now closed). It will later
> trigger an AM notification for a killed container and the AM can make a new
> request to the RM for that container.
> The fix must be tested carefully since this is on the critical path of every
> single container request.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)