I confirm this. I routinely read and process 100 Mbytes/second with very modest hardware on <10 machines. This includes decompression, decryption and expensive parsing that typically saturates the CPU's. I use gigabit ethernet, but otherwise have very similar hardware.
On 11/13/07 7:41 PM, "Raghu Angadi" <[EMAIL PROTECTED]> wrote: > > Normally, Hadoop read saturates either disk b/w or network b/w on > moderate hardware. So if you have one modern IDE disk and 100mbps > ethernet, you should expect around 10MBps read rate for a simple read > from client on different machine. > > Raghu. > > j2eeiscool wrote: >> Hi Raghu, >> >> Just to give me something to compare with: how long should this file read >> (68 megs) take on a good set-up >> >> (client and data node on same network, one hop). >> >> Thanx for your help, >> Taj >> >> >> >> Raghu Angadi wrote: >>> Taj, >>> >>> Even 4 times faster (400 sec for 68MB) is not very fast. First try to >>> scp a similar sized file between the hosts involved. If this transfer is >>> slow, first fix this issue. Try to place the test file on the same >>> partition where HDFS data is stored. >>> >>> With tcpdump, first make sure amount of data transfered matches around >>> 68MB that you expect.. and check for any large gaps in data packets >>> comming to the client. Also when the client is reading, check netstat on >>> both client and the datanode.. note the send buffer on datanode and recv >>> buffer on the client. If datanodes send buffer is non-zero most of the >>> time, then you have some network issue, if recv buffer on client is >>> full, then client is reading slow for some reason... etc. >>> >>> hope this helps. >>> >>> Raghu. >>> >>> j2eeiscool wrote: >>>> Hi Raghu, >>>> >>>> Good catch, thanx. totalBytesRead is not used for any decision etc. >>>> >>>> I ran the client from another m/c and read was about 4 times faster. >>>> >>>> I have the tcpdump from the original client m/c. >>>> This is probably asking too much but anything in particular I should be >>>> looking in the tcpdump. >>>> >>>> Is (tcpdump) about 16 megs in size. >>>> >>>> Thanx, >>>> Taj >>>> >>>> >>>> >>>> >>>> >>>> >>>> Raghu Angadi wrote: >>>>> Thats too long.. buffer size does not explain it. Only small problem I >>>>> see in your code: >>>>> >>>>>> totalBytesRead += bytesReadThisRead; >>>>>> fileNotReadFully = (bytesReadThisRead != -1); >>>>> >>>>> totalBytesRead is off by 1. Not sure where totalBytesRead is used. >>>>> >>>>> If you can, try to check tcpdump on your client machine (for datanode >>>>> port 50010) >>>>> >>>>> Raghu. >>>>> >>>>> j2eeiscool wrote: >>>>>> Hi Raghu, >>>>>> >>>>>> Many thanx for your reply: >>>>>> >>>>>> The write takes approximately: 11367 millisecs. >>>>>> >>>>>> The read takes approximately: 1610565 millisecs. >>>>>> >>>>>> File size is 68573254 bytes and hdfs block size is 64 megs. >>> >> >
