On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <[email protected]> wrote:

> Hey Brian,
>
> This is really interesting stuff. I'm curious - have you tried these same
> experiments using the Java API? I'm wondering whether this is FUSE-specific
> or inherent to all HDFS reads. I'll try to reproduce this over here as well.
>
> This smells sort of nagle-related to me... if you get a chance, you may
> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and
> see if the magic number jumps up to 256KB. If so, I think it should be a
> pretty easy bugfix.
>

Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't actually
used for any socket configuration, so I don't think that will make a
difference... still think networking might be the culprit, though.

-Todd


>
> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman <[email protected]>wrote:
>
>> Ok, here's something perhaps even more strange.  I removed the "seek" part
>> out of my timings, so I was only timing the "read" instead of the "seek +
>> read" as in the first case.  I also turned the read-ahead down to 1-byte
>> (aka, off).
>>
>> The jump *always* occurs at 128KB, exactly.
>>
>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>> sequential reads, not random reads - but it seems that it's one bug-fix away
>> from being a good general-purpose system.  Heck if I can find what's causing
>> the issues though...
>>
>> Brian
>>
>>
>>
>>
>>
>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>
>>  Hey all,
>>>
>>> I was doing some research on I/O patterns of our applications, and I
>>> noticed the attached pattern.  In case if the mail server strips out
>>> attachments, I also uploaded it:
>>>
>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>
>>> This was taken using the FUSE mounts of Hadoop; the first one was with a
>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was taken from a
>>> 2GB file and randomly 'seek'ed in the file.  This was performed 20 times for
>>> each read size, advancing in 4KB increments.  Each blue dot is the read time
>>> of one experiment; the red dot is the median read time for the read size.
>>>  The graphs show the absolute read time.
>>>
>>> There's very interesting behavior - it seems that there is a change in
>>> behavior around reads of size of 800KB.  The time for the reads go down
>>> significantly when you read *larger* files.  I thought this was just an
>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the read-ahead
>>> significantly, to 1MB.  In this case, the difference between the the small
>>> read sizes and large read sizes are *very* pronounced.  If it was an
>>> artifact from FUSE, I'd expect the place where the change occurred would be
>>> a function of the readahead-size.
>>>
>>> Anyone out there who knows the code have any ideas?  What could I be
>>> doing wrong?
>>>
>>> Brian
>>>
>>> <Hadoop_64KB_ra.png>
>>>
>>> <Hadoop_1024KB_ra.png>
>>>
>>
>>
>>
>

Reply via email to