I have another conjecture about this: The test case is making files with a ton of blocks. Appending a block to an end of a file might be O(n) - usually this isn't a problem since even large files are almost always <100 blocks, and the majority <10. In the test, there are files with 50,000+ blocks, so O(n) runtime anywhere in the blocklist for a file is pretty bad.
-Todd On Thu, Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <dhr...@gmail.com> wrote: > Here is another thing that came to my mind. > > The Namenode has a hash map in memory where it inserts all blocks. when a > new block needs to be allocated, the namenode first generates a random > number and checks to see if ti exists in the hashmap. If it does not exist > in the hash map, then that number is the block id of the to-be-allocated > block. The namenode then inserts this number into the hash map and sends it > to te client. The client receives it as the blockid and uses it to write > data to the datanode(s). > > One possibility is that that the time to do a hash-lookup varies depending > on the number of blocks in the hash. > > dhruba > > > > > > On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.ka...@gmail.com> wrote: > >> >launched 8 instances of the bin/hadoop fs -put utility >> Zlatin, may be a silly question, are you running dfs -put locally on each >> datanode, or from a single box >> Also where are you copying the data from, do you have local copies on each >> node before the insert or all your files reside on a single server, or may >> be on NFS? >> i would also chk the network stats on datanodes and namenode and see if >> the nics are not saturated, i guess you have enough bandwidth but may be >> there is some issue with NIC on the namenode or something, i saw strange >> things happening. you can probably monitor the number of conections/sockets, >> bandwidth, IO waits, # of threads >> if you are writing to dfs from a single location may be there is a problem >> on a single node to handle all this outbound traffic, if you are >> distributing files in parallel from multiple nodes, than mat be there is an >> inbound congestion on namenode or something like that >> >> if its not the case, i'd explore using distcp utility for copying data in >> parallel (it comes with the distro) >> also if you really hit a wall, and have some time, i'd take look at >> alternatives to Filesystem API, may be simething like Fuse-DFS and other >> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS) >> >> >> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <t...@cloudera.com> wrote: >> >>> Err, ignore that attachment - attached the wrong graph with the right >>> labels! >>> >>> Here's the right graph. >>> >>> -Todd >>> >>> >>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <t...@cloudera.com> wrote: >>> >>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <e...@lifeless.net> wrote: >>>> >>>>> On 1/13/10 8:12 PM, zlatin.balev...@barclayscapital.com wrote: >>>>> > Alex, Dhruba >>>>> > >>>>> > I repeated the experiment increasing the block size to 32k. Still >>>>> doing >>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes. I was >>>>> > also running iostat on one of the datanodes. Did not notice anything >>>>> > that would explain an exponential slowdown. There was more activity >>>>> > while the inserts were active but far from the limits of the disk >>>>> system. >>>>> >>>>> While creating many blocks, could it be that the replication pipe >>>>> lining >>>>> is eating up the available handler threads on the data nodes? By >>>>> increasing the block size you would see better performance because the >>>>> system spends more time writing data to local disk and less time >>>>> dealing >>>>> with things like replication "overhead." At a small block size, I could >>>>> imagine you're artificially creating a situation where you saturate the >>>>> default size configured thread pools or something weird like that. >>>>> >>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes >>>>> this seems unlikely, but it might be worth looking into. The question >>>>> is >>>>> if testing with an artificially small block size like this is even a >>>>> viable test. At some point the overhead of talking to the name node, >>>>> selecting data nodes for a block, and setting up replication pipe lines >>>>> could become some abnormally high percentage of the run time. >>>>> >>>>> >>>> The concern isn't why the insertion is slow, but rather why the scaling >>>> curve looks the way it does. Looking at the data, it looks like the >>>> insertion rate (blocks per second) is actually related as 1/n where N is >>>> the >>>> number of blocks. Attaching another graph of the same data which I think is >>>> a little clearer to read. >>>> >>>> >>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the >>>>> end of your runtime (if the balancer daemon is running) and this is >>>>> causing additional shuffling of data. >>>>> >>>> >>>> That's certainly one possibility. >>>> >>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks, >>>> let the cluster sit for a few hours, then come back and start another >>>> insertion. Is the rate slow, or does it return to the fast starting speed? >>>> >>>> -Todd >>>> >>> >>> >> > > > -- > Connect to me at http://www.facebook.com/dhruba >