Re: HDFS File Read

Ted Dunning Wed, 14 Nov 2007 09:00:57 -0800

I confirm this.

I routinely read and process 100 Mbytes/second with very modest hardware on
<10 machines.  This includes decompression, decryption and expensive parsing
that typically saturates the CPU's.  I use gigabit ethernet, but otherwise
have very similar hardware.



On 11/13/07 7:41 PM, "Raghu Angadi" <[EMAIL PROTECTED]> wrote:

> 
> Normally, Hadoop read saturates either disk b/w or network b/w on
> moderate hardware. So if you have one modern IDE disk and 100mbps
> ethernet, you should expect around 10MBps read rate for a simple read
> from client on different machine.
> 
> Raghu.
> 
> j2eeiscool wrote:
>> Hi Raghu,
>> 
>> Just to give me something to compare with: how long should this file read
>> (68 megs) take on a good set-up
>> 
>> (client and data node on same network, one hop).
>> 
>> Thanx for your help,
>> Taj
>> 
>> 
>> 
>> Raghu Angadi wrote:
>>> Taj,
>>> 
>>> Even 4 times faster (400 sec for 68MB) is not very fast. First try to
>>> scp a similar sized file between the hosts involved. If this transfer is
>>> slow, first fix this issue. Try to place the test file on the same
>>> partition where HDFS data is stored.
>>> 
>>> With tcpdump, first make sure amount of data transfered matches around
>>> 68MB that you expect.. and check for any large gaps in data packets
>>> comming to the client. Also when the client is reading, check netstat on
>>> both client and the datanode.. note the send buffer on datanode and recv
>>> buffer on the client. If datanodes send buffer is non-zero most of the
>>> time, then you have some network issue, if recv buffer on client is
>>> full, then client is reading slow for some reason... etc.
>>> 
>>> hope this helps.
>>> 
>>> Raghu.
>>> 
>>> j2eeiscool wrote:
>>>> Hi Raghu,
>>>> 
>>>> Good catch, thanx. totalBytesRead  is not used for any decision etc.
>>>> 
>>>> I ran the client from another m/c and read was about 4 times faster.
>>>> 
>>>> I have the tcpdump from the original client m/c.
>>>> This is probably asking too much but anything in particular I should be
>>>> looking in the tcpdump.
>>>> 
>>>> Is (tcpdump) about 16 megs in size.
>>>> 
>>>> Thanx,
>>>> Taj
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Raghu Angadi wrote:
>>>>> Thats too long.. buffer size does not explain it. Only small problem I
>>>>> see in your code:
>>>>> 
>>>>>> totalBytesRead += bytesReadThisRead;
>>>>>> fileNotReadFully = (bytesReadThisRead != -1);
>>>>> 
>>>>> totalBytesRead is off by 1. Not sure where totalBytesRead is used.
>>>>> 
>>>>> If you can, try to check tcpdump on your client machine (for datanode
>>>>> port 50010)
>>>>> 
>>>>> Raghu.
>>>>> 
>>>>> j2eeiscool wrote:
>>>>>> Hi Raghu,
>>>>>> 
>>>>>> Many thanx for your reply:
>>>>>> 
>>>>>> The write takes approximately:  11367 millisecs.
>>>>>> 
>>>>>> The read takes approximately: 1610565 millisecs.
>>>>>> 
>>>>>> File size is  68573254 bytes and hdfs block size is 64 megs.
>>> 
>> 
>

Re: HDFS File Read

Reply via email to