[ 
http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378307 ] 

paul sutter commented on HADOOP-195:
------------------------------------

Owen,
 
Here are a few things you can try for your sort performance issues:
 
(1) DONT NAGLE
 
Nagle's algorithm can add 200ms to a TCP transaction, and could be accounting 
for a chunk of that average 385 milliseconds you're seeing. I think Nagle is 
enabled by default. You could try disabling Nagle with setTcpNoDelay on the 
RPC-server-side socket.

With 64,000 files, one Nagle per file could waste hours.

Note that by disabling Nagle, you'll get some runt packets (for example, the 
header and at the boundary of each socket write). This should not be a material 
performance issue, and you can reduce it arbitrarily by increasing the size of 
the buffer (right now it's a hardcoded 8KB buffer).
 
It would be great Java allowed the use of TCP_CORK and the use of sendfile(), 
since you could avoid runts and avoid a few buffer copies because sendfile does 
it all in the kernel. That's what HTTP servers do, which was your original 
suggestion.
 
Other TCP tuning could help. My intuition is that these wildly varying transfer 
times are related to packet loss, and I expect some level of tuning to help 
this but I haventhad time to look into it. Disabling linger should help when 
packets are lost in the closing handshake.  I'm also an advocate of much larger 
buffers, although that isn't an issue for your particular case. Another minor 
issue (and not causing your particular performance issue) is that we should 
turn on keepalives with a short retry, since without keepalives TCP does not 
_always_ detect hung sessions, and the current code that ignores socket 
timeouts can leave us with a collection of zombie sessions. But of course, 
that's not an issue for your situation.

(2) BIGGER FILES
 
Modern disks transfer at over 50MB/s, yet seek no faster than the drives used 
by the ancient sumerians.
 
15KB takes less than 300 microseconds to transfer to or from a SATA disk. Since 
a seek takes probably more than 10 milliseconds, you're at least 97% seek bound 
with a 15KB file.  Its actually worse: how many disk seeks does it take to 
create a file and then delete it later? (I dont know) 
 
Transferring even 1MB whenever you touch disk is 30-50% seek-bound, because the 
20ms it takes to transfer 1MB is in the same ballpark as the seek time of the 
drive (I get 10-20ms when I measure SATA seek). 

You might as well transfer 4MB or 16MB each time you touch disk, then your code 
will still work a few years from now.
 
Increasing drive transfer speed is due to increasing data density. The more 
bits per inch the more bits per second at a given rotational speed. Our 
grandfathers got away with using 4KB and 8KB block sizes in their JCL because 
data densities were just really low back then. 
 
(3) LESS THRASHING
 
We're moving terabytes, yet the sort path writes map output to disk and reads 
it back in 5 times before the data reaches the reducer:
 
1 - mapper output, 
2 - copy individual map output files to reducer, 
3 - concatenate little files into big file, 
4 - write merge file when sort buffer is full, 
5 - write merged data to disk.
 
Steps 3 and 5 are the first opportunity for a speedup because they exist purely 
for programmer convenience. The sorter can read directly from a list of files 
instead of one file, and the reducer could call the merger directly to collect 
the input records instead of storing it to disk as an intermediate step.
 
Step 2 permits the mapper-to-reducer copy to be restarted if it fails, but 
strictly speaking that data only needs to go to disk if the sort buffer doesn't 
have enough space left to hold the entire file (since it would be easy to back 
out of a failed transfer as long as the sort buffer doesn't fill up). Also, you 
could just start the sort early when you encountered a file that would fit in 
the sort buffer, but there's just not enough room left in the sort buffer. 
 
Step 4 only needs to occur when the data is too large to fit into RAM.
 
Whether or not step 1 needs to occur could be a matter of debate.
 
(4) BIG BUFFERS IN MERGE PHASE
 
The default 4KB buffer size is really bad for the merge phase (see suggestion 
2). Since the merger is reading from a bunch of merge files, and from the 
perspective of the OS, you are performing one read at a time from a 
randomly-chosen merge file, its unlikely for the OS to figure out that you 
really want to sequentialize. 

It's a little daredevil to rely on Linux doing 1000-to-1 sequentialization 
anyway; you might as well just use a big buffer and be done with it -I'm not 
sure of the benefit of the small buffer.
 
(5) OTHER IMPROVEMENTS
 
The DFS write of the data coming out of the reducer currently writes a local 
copy of data to disk, but this could be eliminated by using a 32MB RAM buffer 
instead of a disk file for recoveries from connection failures. 
 
In general, we'd be better off using NIO instead of stream IO because of fewer 
buffer copies and better control of buffering and endian-ness. Stream IO does 
at least two buffer copies of the data, one in the kernel because of use of the 
disk cache, and at least one in Java.  Yes, I realize that this would be a 
major rewrite and less important than the others mentioned. 
 
It would be interesting to calculate the number of buffer copies in the current 
sort path above considering that you get two buffer copies when you write the 
disk and two when you read the disk, and you have 5 extra visits to disk plus 
the network transfer. 


> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

>
> The data transfer of the map output should be transfered via http instead 
> rpc, because rpc is very slow for this application and the timeout behavior 
> is suboptimal. (server sends data and client ignores it because it took more 
> than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to