Re: Interesting Hadoop/FUSE-DFS access patterns

Tom White Thu, 16 Apr 2009 03:18:12 -0700

Not sure if will affect your findings, but when you read from a
FSDataInputStream you should see how many bytes were actually read by
inspecting the return value and re-read if it was fewer than you want.
See Hadoop's IOUtils readFully() method.


Tom

On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman <[email protected]> wrote:
>
> Hey Todd,
>
> Been playing more this morning after thinking about it for the night -- I
> think the culprit is not the network, but actually the cache.  Here's the
> output of your script adjusted to do the same calls as I was doing (you had
> left out the random I/O part).
>
> [br...@red tmp]$ java hdfs_tester
> Mean value for reads of size 0: 0.0447
> Mean value for reads of size 16384: 10.4872
> Mean value for reads of size 32768: 10.82925
> Mean value for reads of size 49152: 6.2417
> Mean value for reads of size 65536: 7.0511003
> Mean value for reads of size 81920: 9.411599
> Mean value for reads of size 98304: 9.378799
> Mean value for reads of size 114688: 8.99065
> Mean value for reads of size 131072: 5.1378503
> Mean value for reads of size 147456: 6.1324
> Mean value for reads of size 163840: 17.1187
> Mean value for reads of size 180224: 6.5492
> Mean value for reads of size 196608: 8.45695
> Mean value for reads of size 212992: 7.4292
> Mean value for reads of size 229376: 10.7843
> Mean value for reads of size 245760: 9.29095
> Mean value for reads of size 262144: 6.57865
>
> Copy of the script below.
>
> So, without the FUSE layer, we don't see much (if any) patterns here.  The
> overhead of randomly skipping through the file is higher than the overhead
> of reading out the data.
>
> Upon further inspection, the biggest factor affecting the FUSE layer is
> actually the Linux VFS caching -- if you notice, the bandwidth in the given
> graph for larger read sizes is *higher* than 1Gbps, which is the limit of
> the network on that particular node.  If I go in the opposite direction -
> starting with the largest reads first, then going down to the smallest
> reads, the graph entirely smooths out for the small values - everything is
> read from the filesystem cache in the client RAM.  Graph attached.
>
> So, on the upside, mounting through FUSE gives us the opportunity to speed
> up reads for very complex, non-sequential patterns - for free, thanks to the
> hardworking Linux kernel.  On the downside, it's incredibly difficult to
> come up with simple cases to demonstrate performance for an application --
> the cache performance and size depends on how much activity there's on the
> client, the previous file system activity that the application did, and the
> amount of concurrent activity on the server.  I can give you results for
> performance, but it's not going to be the performance you see in real life.
>  (Gee, if only file systems were easy...)
>
> Ok, sorry for the list noise -- it seems I'm going to have to think more
> about this problem before I can come up with something coherent.
>
> Brian
>
>
>
>
>
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FileStatus;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FSDataInputStream;
> import org.apache.hadoop.conf.Configuration;
> import java.io.IOException;
> import java.net.URI;
> import java.util.Random;
>
> public class hdfs_tester {
>  public static void main(String[] args) throws Exception {
>   URI uri = new URI("hdfs://hadoop-name:9000/");
>   FileSystem fs = FileSystem.get(uri, new Configuration());
>   Path path = new
> Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33");
>   FSDataInputStream dis = fs.open(path);
>   Random rand = new Random();
>   FileStatus status = fs.getFileStatus(path);
>   long file_len = status.getLen();
>   int iters = 20;
>   for (int size=0; size < 1024*1024; size += 4*4096) {
>     long csum = 0;
>     for (int i = 0; i < iters; i++) {
>       int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
>       byte buf[] = new byte[size];
>       if (pos < 0)
>         pos = 0;
>       long st = System.nanoTime();
>       dis.read(pos, buf, 0, size);
>       long et = System.nanoTime();
>       csum += et-st;
>       //System.out.println(String.valueOf(size) + "\t" + String.valueOf(pos)
> + "\t" + String.valueOf(et - st));
>     }
>     float csum2 = csum; csum2 /= iters;
>     System.out.println("Mean value for reads of size " + size + ": " +
> (csum2/1000/1000));
>   }
>   fs.close();
>  }
> }
>
>
> On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:
>
>> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <[email protected]> wrote:
>>
>>> Hey Brian,
>>>
>>> This is really interesting stuff. I'm curious - have you tried these same
>>> experiments using the Java API? I'm wondering whether this is
>>> FUSE-specific
>>> or inherent to all HDFS reads. I'll try to reproduce this over here as
>>> well.
>>>
>>> This smells sort of nagle-related to me... if you get a chance, you may
>>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and
>>> see if the magic number jumps up to 256KB. If so, I think it should be a
>>> pretty easy bugfix.
>>>
>>
>> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't actually
>> used for any socket configuration, so I don't think that will make a
>> difference... still think networking might be the culprit, though.
>>
>> -Todd
>>
>>
>>>
>>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman
>>> <[email protected]>wrote:
>>>
>>>> Ok, here's something perhaps even more strange.  I removed the "seek"
>>>> part
>>>> out of my timings, so I was only timing the "read" instead of the "seek
>>>> +
>>>> read" as in the first case.  I also turned the read-ahead down to 1-byte
>>>> (aka, off).
>>>>
>>>> The jump *always* occurs at 128KB, exactly.
>>>>
>>>> I'm a bit befuddled.  I know we say that HDFS is optimized for large,
>>>> sequential reads, not random reads - but it seems that it's one bug-fix
>>>> away
>>>> from being a good general-purpose system.  Heck if I can find what's
>>>> causing
>>>> the issues though...
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote:
>>>>
>>>> Hey all,
>>>>>
>>>>> I was doing some research on I/O patterns of our applications, and I
>>>>> noticed the attached pattern.  In case if the mail server strips out
>>>>> attachments, I also uploaded it:
>>>>>
>>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png
>>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png
>>>>>
>>>>> This was taken using the FUSE mounts of Hadoop; the first one was with
>>>>> a
>>>>> 64KB read-ahead and the second with a 1MB read-ahead.  This was taken
>>>>> from a
>>>>> 2GB file and randomly 'seek'ed in the file.  This was performed 20
>>>>> times for
>>>>> each read size, advancing in 4KB increments.  Each blue dot is the read
>>>>> time
>>>>> of one experiment; the red dot is the median read time for the read
>>>>> size.
>>>>> The graphs show the absolute read time.
>>>>>
>>>>> There's very interesting behavior - it seems that there is a change in
>>>>> behavior around reads of size of 800KB.  The time for the reads go down
>>>>> significantly when you read *larger* files.  I thought this was just an
>>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the
>>>>> read-ahead
>>>>> significantly, to 1MB.  In this case, the difference between the the
>>>>> small
>>>>> read sizes and large read sizes are *very* pronounced.  If it was an
>>>>> artifact from FUSE, I'd expect the place where the change occurred
>>>>> would be
>>>>> a function of the readahead-size.
>>>>>
>>>>> Anyone out there who knows the code have any ideas?  What could I be
>>>>> doing wrong?
>>>>>
>>>>> Brian
>>>>>
>>>>> <Hadoop_64KB_ra.png>
>>>>>
>>>>> <Hadoop_1024KB_ra.png>
>>>>>
>>>>
>>>>
>>>>
>>>
>
>
>

Re: Interesting Hadoop/FUSE-DFS access patterns

Reply via email to