[ http://issues.apache.org/jira/browse/HADOOP-141?page=comments#action_12375204 ]
paul sutter commented on HADOOP-141: ------------------------------------ As it turns out, my changes did not fix the problem, just changed the timing. The thrashing was occucring because one reducer was in the merge phase, and the other reducer was in the file copy phase. The particular file that was failing, was being copied from the local system. I have the concurrent merges set to 24 and the task count set to 4. I added logging statements, and the file was clearly being received in full by MapOutputFile, yet ReduceTaskRunner was getting a timeout on that file about 1 minute and 20 seconds later, request it again and again, and each time receive the file yet get a timeout just over a minute later. I did find two interesting bug in RPC.java while trying to track this down (which im filing separately), but for now I am completely stumped. At the moment the cluster is otherwise busy, so I cant do any more experiments until perhaps tomororw. Any suggestions would be very welcome. We are using Linux, and I'll try the commands you suggested when Im able to recreate it, but for now this does not look like a disk or TCP problem, it really looks like an RPC scheduling problem. > Disk thrashing / task timeouts during map output copy phase > ----------------------------------------------------------- > > Key: HADOOP-141 > URL: http://issues.apache.org/jira/browse/HADOOP-141 > Project: Hadoop > Type: Bug > Components: mapred > Environment: linux > Reporter: paul sutter > > MapOutputProtocol connections cause timeouts because of system thrashing and > transferring the same file over and over again, ultimately leading to making > no forward progress(medium sized job, 500GB input file, map output about as > large as the input, 10 node cluster). > There are several bugs behind this, but the following two changes improved > matters considerably. > (1) > The buffersize in MapOutputFile is currently hardcoded to 8192 bytes (for > both reads and writes). By changing this buffer size to 256KB, the number of > disk seeks are reduced and the problem went away. > Ideally there would be a buffer size parameter for this that is separate from > the DFS io buffer size. > (2) > I also added the following code to the socket configuration in both > Server.java and Client.java. No linger is a minor good idea in an enivronment > with some packet loss (and you will have that when all the nodes get busy at > once), but 256KB buffers is probably excessive, especially on a LAN, but it > takes me two hours to test changes so I havent experimented. > socket.setSendBufferSize(256*1024); > socket.setReceiveBufferSize(256*1024); > socket.setSoLinger(false, 0); > socket.setKeepAlive(true); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
