[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-5584:
----------------------------------

    Priority: Blocker  (was: Major)

The reducers were timing out attempting to contact certain nodes for their map 
inputs.  Simple GET probes to the shuffle port on these nodes showed that they 
were indeed totally unresponsive.  Examination of the nodes showed that they 
had leaked a significant number of file descriptors with sockets in the 
CLOSE_WAIT state.

The jstacks of the NodeManager processes on these nodes also showed that all of 
the Netty handlers were stuck somewhere in 
LocalDirAllocator.getLocalPathToRead.  They were either stuck on the 
synchronized lock or waiting for the results of fs.exists() to return which now 
forks and execs {{stat}} since HADOOP-9652.

> ShuffleHandler becomes unresponsive during gridmix runs and can leak file 
> descriptors
> -------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5584
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5584
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>
> While running gridmix on 2.3 we noticed that jobs are running much slower 
> than normal.  We tracked this down to reducers having difficulties shuffling 
> data from maps.  Details to follow.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to