On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <alex.ka...@gmail.com> wrote:
> >I'm doing the insert from a node on the same rack as the cluster but it > is not part of it. > So it looks like you are copying from a single node, I'd try to run the > inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or > network bottleneck on the "source" node. Try to upload from multiple > locations. > > You're still missing the point here, Alex: it's not about the performance, it's about the scaling curve. -Todd > > On Thu, Jan 14, 2010 at 12:13 PM, <zlatin.balev...@barclayscapital.com>wrote: > >> More general info: >> >> I'm doing the insert from a node on the same rack as the cluster but it is >> not part of it. Data is being read from a local disk and the datanodes >> store to the local partitions as well. The filesystem is ext3, but if it >> were an inode issue the 4-node cluster would perform much worse than the >> 11-node. No MR jobs or any other activity is present - these are test >> clusters that I create and remove with HOD. I'm using revision 897952 (the >> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few >> days ago. >> >> Todd: >> >> I will repeat the test, waiting several hours after the first round of >> inserts. Unless the balancer daemon starts by default, I have not started >> it. The datablocks seemed uniformly spread amongst the datanodes. I've >> added two additional metrics to be recorded by the datanode - >> DataNode.xmitsInProgress and DataNode.getXCeiverCount(). These are polled >> every 10 seconds. If anyone wants me to add additional metrics at any >> component let me know. >> >> > The test case is making files with a ton of blocks. Appending a block >> to an end of a file might be O(n) - >> > usually this isn't a problem since even large files are almost always >> <100 blocks, and the majority <10. >> > In the test, there are files with 50,000+ blocks, so O(n) runtime >> anywhere in the blocklist for a file is pretty bad. >> >> The files in the first test are 16k blocks each. I am inserting the >> same file under different filenames in consecutive runs. If that were the >> reason, the first insert should take the same amount of time as the last. >> Nevertheless, I will run the next test with 4k blocks per file and increase >> the number of consecutive insertions. >> >> Dhruba: >> Unless there is a very high number of collisions, the hashmap should >> perform in constant time. Even if there were collisions, I would be seeing >> much higher CPU usage on the NameNode. According to the metrics I've >> already sent, towards the end of the test the capacity of the BlockMap >> was 512k and the load approaching 0.66. >> >> Best Regards, >> Zlatin Balevsky >> >> P.S. I could not find contact info for the HOD developers. I'd like to >> ask them to document the "walltime" and "idleness-limit" parameters! >> >> ------------------------------ >> *From:* Dhruba Borthakur [mailto:dhr...@gmail.com] >> *Sent:* Thursday, January 14, 2010 9:04 AM >> >> *To:* hdfs-user@hadoop.apache.org >> *Subject:* Re: Exponential performance decay when inserting large number >> of blocks >> >> Here is another thing that came to my mind. >> >> The Namenode has a hash map in memory where it inserts all blocks. when a >> new block needs to be allocated, the namenode first generates a random >> number and checks to see if ti exists in the hashmap. If it does not exist >> in the hash map, then that number is the block id of the to-be-allocated >> block. The namenode then inserts this number into the hash map and sends it >> to te client. The client receives it as the blockid and uses it to write >> data to the datanode(s). >> >> One possibility is that that the time to do a hash-lookup varies depending >> on the number of blocks in the hash. >> >> dhruba >> >> >> >> >> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.ka...@gmail.com> wrote: >> >>> >launched 8 instances of the bin/hadoop fs -put utility >>> Zlatin, may be a silly question, are you running dfs -put locally on each >>> datanode, or from a single box >>> Also where are you copying the data from, do you have local copies on >>> each node before the insert or all your files reside on a single server, or >>> may be on NFS? >>> i would also chk the network stats on datanodes and namenode and see if >>> the nics are not saturated, i guess you have enough bandwidth but may be >>> there is some issue with NIC on the namenode or something, i saw strange >>> things happening. you can probably monitor the number of conections/sockets, >>> bandwidth, IO waits, # of threads >>> if you are writing to dfs from a single location may be there is a >>> problem on a single node to handle all this outbound traffic, if you are >>> distributing files in parallel from multiple nodes, than mat be there is an >>> inbound congestion on namenode or something like that >>> >>> if its not the case, i'd explore using distcp utility for copying data in >>> parallel (it comes with the distro) >>> also if you really hit a wall, and have some time, i'd take look at >>> alternatives to Filesystem API, may be simething like Fuse-DFS and other >>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS) >>> >>> >>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <t...@cloudera.com> wrote: >>> >>>> Err, ignore that attachment - attached the wrong graph with the right >>>> labels! >>>> >>>> Here's the right graph. >>>> >>>> -Todd >>>> >>>> >>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <t...@cloudera.com> wrote: >>>> >>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <e...@lifeless.net>wrote: >>>>> >>>>>> On 1/13/10 8:12 PM, zlatin.balev...@barclayscapital.com wrote: >>>>>> > Alex, Dhruba >>>>>> > >>>>>> > I repeated the experiment increasing the block size to 32k. Still >>>>>> doing >>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes. I was >>>>>> > also running iostat on one of the datanodes. Did not notice >>>>>> anything >>>>>> > that would explain an exponential slowdown. There was more activity >>>>>> > while the inserts were active but far from the limits of the disk >>>>>> system. >>>>>> >>>>>> While creating many blocks, could it be that the replication pipe >>>>>> lining >>>>>> is eating up the available handler threads on the data nodes? By >>>>>> increasing the block size you would see better performance because the >>>>>> system spends more time writing data to local disk and less time >>>>>> dealing >>>>>> with things like replication "overhead." At a small block size, I >>>>>> could >>>>>> imagine you're artificially creating a situation where you saturate >>>>>> the >>>>>> default size configured thread pools or something weird like that. >>>>>> >>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes >>>>>> this seems unlikely, but it might be worth looking into. The question >>>>>> is >>>>>> if testing with an artificially small block size like this is even a >>>>>> viable test. At some point the overhead of talking to the name node, >>>>>> selecting data nodes for a block, and setting up replication pipe >>>>>> lines >>>>>> could become some abnormally high percentage of the run time. >>>>>> >>>>>> >>>>> The concern isn't why the insertion is slow, but rather why the scaling >>>>> curve looks the way it does. Looking at the data, it looks like the >>>>> insertion rate (blocks per second) is actually related as 1/n where N is >>>>> the >>>>> number of blocks. Attaching another graph of the same data which I think >>>>> is >>>>> a little clearer to read. >>>>> >>>>> >>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the >>>>>> end of your runtime (if the balancer daemon is running) and this is >>>>>> causing additional shuffling of data. >>>>>> >>>>> >>>>> That's certainly one possibility. >>>>> >>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks, >>>>> let the cluster sit for a few hours, then come back and start another >>>>> insertion. Is the rate slow, or does it return to the fast starting speed? >>>>> >>>>> -Todd >>>>> >>>> >>>> >>> >> >> >> -- >> Connect to me at http://www.facebook.com/dhruba >> >> _______________________________________________ >> >> >> >> This e-mail may contain information that is confidential, privileged or >> otherwise protected from disclosure. If you are not an intended recipient of >> this e-mail, do not duplicate or redistribute it by any means. Please delete >> it and any attachments and notify the sender that you have received it in >> error. Unless specifically indicated, this e-mail is not an offer to buy or >> sell or a solicitation to buy or sell any securities, investment products or >> other financial product or service, an official confirmation of any >> transaction, or an official statement of Barclays. Any views or opinions >> presented are solely those of the author and do not necessarily represent >> those of Barclays. This e-mail is subject to terms available at the >> following link: www.barcap.com/emaildisclaimer. By messaging with >> Barclays you consent to the foregoing. Barclays Capital is the >> investment banking division of Barclays Bank PLC, a company registered in >> England (number 1026167) with its registered office at 1 Churchill Place, >> London, E14 5HP. This email may relate to or be sent from other members >> of the Barclays Group.** >> >> _______________________________________________ >> > >