Not sure if will affect your findings, but when you read from a FSDataInputStream you should see how many bytes were actually read by inspecting the return value and re-read if it was fewer than you want. See Hadoop's IOUtils readFully() method.
Tom On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman <[email protected]> wrote: > > Hey Todd, > > Been playing more this morning after thinking about it for the night -- I > think the culprit is not the network, but actually the cache. Here's the > output of your script adjusted to do the same calls as I was doing (you had > left out the random I/O part). > > [br...@red tmp]$ java hdfs_tester > Mean value for reads of size 0: 0.0447 > Mean value for reads of size 16384: 10.4872 > Mean value for reads of size 32768: 10.82925 > Mean value for reads of size 49152: 6.2417 > Mean value for reads of size 65536: 7.0511003 > Mean value for reads of size 81920: 9.411599 > Mean value for reads of size 98304: 9.378799 > Mean value for reads of size 114688: 8.99065 > Mean value for reads of size 131072: 5.1378503 > Mean value for reads of size 147456: 6.1324 > Mean value for reads of size 163840: 17.1187 > Mean value for reads of size 180224: 6.5492 > Mean value for reads of size 196608: 8.45695 > Mean value for reads of size 212992: 7.4292 > Mean value for reads of size 229376: 10.7843 > Mean value for reads of size 245760: 9.29095 > Mean value for reads of size 262144: 6.57865 > > Copy of the script below. > > So, without the FUSE layer, we don't see much (if any) patterns here. The > overhead of randomly skipping through the file is higher than the overhead > of reading out the data. > > Upon further inspection, the biggest factor affecting the FUSE layer is > actually the Linux VFS caching -- if you notice, the bandwidth in the given > graph for larger read sizes is *higher* than 1Gbps, which is the limit of > the network on that particular node. If I go in the opposite direction - > starting with the largest reads first, then going down to the smallest > reads, the graph entirely smooths out for the small values - everything is > read from the filesystem cache in the client RAM. Graph attached. > > So, on the upside, mounting through FUSE gives us the opportunity to speed > up reads for very complex, non-sequential patterns - for free, thanks to the > hardworking Linux kernel. On the downside, it's incredibly difficult to > come up with simple cases to demonstrate performance for an application -- > the cache performance and size depends on how much activity there's on the > client, the previous file system activity that the application did, and the > amount of concurrent activity on the server. I can give you results for > performance, but it's not going to be the performance you see in real life. > (Gee, if only file systems were easy...) > > Ok, sorry for the list noise -- it seems I'm going to have to think more > about this problem before I can come up with something coherent. > > Brian > > > > > > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FileStatus; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FSDataInputStream; > import org.apache.hadoop.conf.Configuration; > import java.io.IOException; > import java.net.URI; > import java.util.Random; > > public class hdfs_tester { > public static void main(String[] args) throws Exception { > URI uri = new URI("hdfs://hadoop-name:9000/"); > FileSystem fs = FileSystem.get(uri, new Configuration()); > Path path = new > Path("/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33"); > FSDataInputStream dis = fs.open(path); > Random rand = new Random(); > FileStatus status = fs.getFileStatus(path); > long file_len = status.getLen(); > int iters = 20; > for (int size=0; size < 1024*1024; size += 4*4096) { > long csum = 0; > for (int i = 0; i < iters; i++) { > int pos = rand.nextInt((int)((file_len-size-1)/8))*8; > byte buf[] = new byte[size]; > if (pos < 0) > pos = 0; > long st = System.nanoTime(); > dis.read(pos, buf, 0, size); > long et = System.nanoTime(); > csum += et-st; > //System.out.println(String.valueOf(size) + "\t" + String.valueOf(pos) > + "\t" + String.valueOf(et - st)); > } > float csum2 = csum; csum2 /= iters; > System.out.println("Mean value for reads of size " + size + ": " + > (csum2/1000/1000)); > } > fs.close(); > } > } > > > On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote: > >> On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon <[email protected]> wrote: >> >>> Hey Brian, >>> >>> This is really interesting stuff. I'm curious - have you tried these same >>> experiments using the Java API? I'm wondering whether this is >>> FUSE-specific >>> or inherent to all HDFS reads. I'll try to reproduce this over here as >>> well. >>> >>> This smells sort of nagle-related to me... if you get a chance, you may >>> want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and >>> see if the magic number jumps up to 256KB. If so, I think it should be a >>> pretty easy bugfix. >>> >> >> Oops - spoke too fast there... looks like TCP_WINDOW_SIZE isn't actually >> used for any socket configuration, so I don't think that will make a >> difference... still think networking might be the culprit, though. >> >> -Todd >> >> >>> >>> On Sun, Apr 12, 2009 at 9:41 PM, Brian Bockelman >>> <[email protected]>wrote: >>> >>>> Ok, here's something perhaps even more strange. I removed the "seek" >>>> part >>>> out of my timings, so I was only timing the "read" instead of the "seek >>>> + >>>> read" as in the first case. I also turned the read-ahead down to 1-byte >>>> (aka, off). >>>> >>>> The jump *always* occurs at 128KB, exactly. >>>> >>>> I'm a bit befuddled. I know we say that HDFS is optimized for large, >>>> sequential reads, not random reads - but it seems that it's one bug-fix >>>> away >>>> from being a good general-purpose system. Heck if I can find what's >>>> causing >>>> the issues though... >>>> >>>> Brian >>>> >>>> >>>> >>>> >>>> >>>> On Apr 12, 2009, at 8:53 PM, Brian Bockelman wrote: >>>> >>>> Hey all, >>>>> >>>>> I was doing some research on I/O patterns of our applications, and I >>>>> noticed the attached pattern. In case if the mail server strips out >>>>> attachments, I also uploaded it: >>>>> >>>>> http://t2.unl.edu/store/Hadoop_64KB_ra.png >>>>> http://t2.unl.edu/store/Hadoop_1024KB_ra.png >>>>> >>>>> This was taken using the FUSE mounts of Hadoop; the first one was with >>>>> a >>>>> 64KB read-ahead and the second with a 1MB read-ahead. This was taken >>>>> from a >>>>> 2GB file and randomly 'seek'ed in the file. This was performed 20 >>>>> times for >>>>> each read size, advancing in 4KB increments. Each blue dot is the read >>>>> time >>>>> of one experiment; the red dot is the median read time for the read >>>>> size. >>>>> The graphs show the absolute read time. >>>>> >>>>> There's very interesting behavior - it seems that there is a change in >>>>> behavior around reads of size of 800KB. The time for the reads go down >>>>> significantly when you read *larger* files. I thought this was just an >>>>> artifact of the 64KB read-ahead I set in FUSE, so I upped the >>>>> read-ahead >>>>> significantly, to 1MB. In this case, the difference between the the >>>>> small >>>>> read sizes and large read sizes are *very* pronounced. If it was an >>>>> artifact from FUSE, I'd expect the place where the change occurred >>>>> would be >>>>> a function of the readahead-size. >>>>> >>>>> Anyone out there who knows the code have any ideas? What could I be >>>>> doing wrong? >>>>> >>>>> Brian >>>>> >>>>> <Hadoop_64KB_ra.png> >>>>> >>>>> <Hadoop_1024KB_ra.png> >>>>> >>>> >>>> >>>> >>> > > >
