[ 
http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378335 ] 

paul sutter commented on HADOOP-195:
------------------------------------


dominek,

buffer copies are nowhere near a bottleneck in hadoop, yet. right now we have 
lots of wins just from getting our buffering right.

reducing buffer copies only matters when buffer copies are a bottleneck. you 
would have to use a profiler to see how much time was being spent in your 
serialization/deserialization code, for example. if your code is the 
bottleneck, then reducing buffer copies might not matter. how long are your 
requests? if they are small, its not likely to matter. if they are gigabytes, 
then it could matter a lot.

other questions about your use of NIO:
- did you try using the native endian-ness with NIO? or the default? (the 
default is evil Sun endian-ness)
- are you using direct buffers, or indirect? (indirect buffers still cost you a 
buffer copy in user space)
- are you using memory mapping, or buffered io? (buffered io costs you a buffer 
copy in kernel space)

of course, an honest-to-god unbuffered read is so much better than memory 
mapping. someone who is more of a unix guy could help you figure out which 
linux filesystem supports real unbuffered io, and how to make that happen from 
java. when you're memory mapped, its hard to coerce the system into doing the 
multimegabyte double-buffered reads that you really want to do if you are 
interested in performance. you might have to use JNI to make that io fast. but 
again, its only worthwhile if you know where the bottlenecks are. windows nt is 
popular among sort people because its so easy to get an honest-to-god 
unbuffered io.

but again, none of that matters if you're not moving much data, or if you dont 
have a buffer copy bottleneck.

using the JNI interface you mention sounds interesting. of course, if we're 
going to go non-pure-java, we might as well use owen's idea of an http server 
to serve up the map output data, since that server will already be tuned. we're 
using lighttpd here and getting super good performance (for a different 
application of course).

im super glad there's an interest in performance here!

paul

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

>
> The data transfer of the map output should be transfered via http instead 
> rpc, because rpc is very slow for this application and the timeout behavior 
> is suboptimal. (server sends data and client ignores it because it took more 
> than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to