[
https://issues.apache.org/jira/browse/MAPREDUCE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-5584:
----------------------------------
Priority: Blocker (was: Major)
The reducers were timing out attempting to contact certain nodes for their map
inputs. Simple GET probes to the shuffle port on these nodes showed that they
were indeed totally unresponsive. Examination of the nodes showed that they
had leaked a significant number of file descriptors with sockets in the
CLOSE_WAIT state.
The jstacks of the NodeManager processes on these nodes also showed that all of
the Netty handlers were stuck somewhere in
LocalDirAllocator.getLocalPathToRead. They were either stuck on the
synchronized lock or waiting for the results of fs.exists() to return which now
forks and execs {{stat}} since HADOOP-9652.
> ShuffleHandler becomes unresponsive during gridmix runs and can leak file
> descriptors
> -------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5584
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5584
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Priority: Blocker
>
> While running gridmix on 2.3 we noticed that jobs are running much slower
> than normal. We tracked this down to reducers having difficulties shuffling
> data from maps. Details to follow.
--
This message was sent by Atlassian JIRA
(v6.1#6144)