[
https://issues.apache.org/jira/browse/HADOOP-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845557#comment-13845557
]
Daryn Sharp commented on HADOOP-10146:
--------------------------------------
Anyone want to review? After moving to JDK7 in production, we had many NMs
under load going OOM and crashing due to this bug. Task retries masked that
the cluster was slowly shrinking. As noted above, we've been running
production clusters for 8 months with this patch.
> Workaround JDK7 Process fd close bug
> ------------------------------------
>
> Key: HADOOP-10146
> URL: https://issues.apache.org/jira/browse/HADOOP-10146
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Priority: Critical
> Attachments: HADOOP-10129.branch-23.patch, HADOOP-10129.patch
>
>
> JDK7's {{Process}} output streams have an async fd-close race bug. This
> manifests as commands run via o.a.h.u.Shell causing threads to hang, OOM, or
> cause other bizarre behavior. The NM is likely to encounter the bug under
> heavy load.
> Specifically, {{ProcessBuilder}}'s {{UNIXProcess}} starts a thread to reap
> the process and drain stdout/stderr to avoid a lingering zombie process. A
> race occurs if the thread using the stream closes it, the underlying fd is
> recycled/reopened, while the reaper is draining it.
> {{ProcessPipeInputStream.drainInputStream}}'s will OOM allocating an array if
> {{in.available()}} returns a huge number, or may wreak havoc by incorrectly
> draining the fd.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)