RE: Exponential performance decay when inserting large number of blocks

Zlatin.Balevsky Thu, 14 Jan 2010 09:14:59 -0800

More general info:
 
I'm doing the insert from a node on the same rack as the cluster but it
is not part of it.  Data is being read from a local disk and the
datanodes store to the local partitions as well.  The filesystem is
ext3, but if it were an inode issue the 4-node cluster would perform
much worse than the 11-node.  No MR jobs or any other activity is
present - these are test clusters that I create and remove with HOD.
I'm using revision 897952 (the hdfs ui reports 897347 for some reason) ,
checked out the branch-0.20 a few days ago.
 
Todd:
 
I will repeat the test, waiting several hours after the first round of
inserts.  Unless the balancer daemon starts by default, I have not
started it.  The datablocks seemed uniformly spread amongst the
datanodes.  I've added two additional metrics to be recorded by the
datanode - DataNode.xmitsInProgress and DataNode.getXCeiverCount().
These are polled every 10 seconds.  If anyone wants me to add additional
metrics at any component let me know.
 
> The test case is making files with a ton of blocks. Appending a block
to an end of a file might be O(n) - 
> usually this isn't a problem since even large files are almost always
<100 blocks, and the majority <10. 
> In the test, there are files with 50,000+ blocks, so O(n) runtime
anywhere in the blocklist for a file is pretty bad.

The files in the first test are 16k blocks each.  I am inserting the
same file under different filenames in consecutive runs.  If that were
the reason, the first insert should take the same amount of time as the
last.  Nevertheless, I will run the next test with 4k blocks per file
and increase the number of consecutive insertions.

Dhruba:
Unless there is a very high number of collisions, the hashmap should
perform in constant time.  Even if there were collisions, I would be
seeing much higher CPU usage on the NameNode.  According to the metrics
I've already sent, towards the end of the test the capacity of the
BlockMap was 512k and the load approaching 0.66.  

Best Regards,
Zlatin Balevsky

P.S. I could not find contact info for the HOD developers.  I'd like to
ask them to document the "walltime" and "idleness-limit" parameters!

________________________________

From: Dhruba Borthakur [mailto:dhr...@gmail.com] 
Sent: Thursday, January 14, 2010 9:04 AM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks

Here is another thing that came to my mind.

The Namenode has a hash map in memory where it inserts all blocks. when
a new block needs to be allocated, the namenode first generates a random
number and checks to see if ti exists in the hashmap. If it does not
exist in the hash map, then that number is the block id of the
to-be-allocated block. The namenode then inserts this number into the
hash map and sends it to te client. The client receives it as the
blockid and uses it to write data to the datanode(s).

One possibility is that that the time to do a hash-lookup varies
depending on the number of blocks in the hash.

dhruba

On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.ka...@gmail.com>
wrote:

        >launched 8 instances of the bin/hadoop fs -put utility
        Zlatin, may be a silly question, are you running dfs -put
locally on each datanode,  or from a single box 
        Also where are you copying the data from, do you have local
copies on each node before the insert or all your files reside on a
single server, or may be on NFS?
        i would also chk the network stats on datanodes and namenode and
see if the nics are not saturated, i guess you have enough bandwidth but
may be there is some issue with NIC on the namenode or something, i saw
strange things happening. you can probably monitor the number of
conections/sockets, bandwidth, IO waits, # of threads 
        if you are writing to dfs from a single location may be there is
a problem on a single node to handle all this outbound traffic, if you
are distributing files in parallel from multiple nodes, than mat be
there is an inbound congestion on namenode or something like that

        if its not the case, i'd explore using distcp utility for
copying data in parallel  (it comes with the distro)
        also if you really hit a wall, and have some time, i'd take look
at alternatives to Filesystem API, may be simething like Fuse-DFS and
other packages supported by libhdfs
(http://wiki.apache.org/hadoop/LibHDFS)

        On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon
<t...@cloudera.com> wrote:

                Err, ignore that attachment - attached the wrong graph
with the right labels! 

                Here's the right graph.

                -Todd 

                On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon
<t...@cloudera.com> wrote:

                        On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer
<e...@lifeless.net> wrote:

                                On 1/13/10 8:12 PM,
zlatin.balev...@barclayscapital.com wrote:
                                > Alex, Dhruba
                                >
                                > I repeated the experiment increasing
the block size to 32k.  Still doing
                                > 8 inserts in parallel, file size now
is 512 MB; 11 datanodes.  I was
                                > also running iostat on one of the
datanodes.  Did not notice anything
                                > that would explain an exponential
slowdown.  There was more activity
                                > while the inserts were active but far
from the limits of the disk system.

                                While creating many blocks, could it be
that the replication pipe lining
                                is eating up the available handler
threads on the data nodes? By
                                increasing the block size you would see
better performance because the
                                system spends more time writing data to
local disk and less time dealing
                                with things like replication "overhead."
At a small block size, I could
                                imagine you're artificially creating a
situation where you saturate the
                                default size configured thread pools or
something weird like that.

                                If you're doing 8 inserts in parallel
from one machine with 11 nodes
                                this seems unlikely, but it might be
worth looking into. The question is
                                if testing with an artificially small
block size like this is even a
                                viable test. At some point the overhead
of talking to the name node,
                                selecting data nodes for a block, and
setting up replication pipe lines
                                could become some abnormally high
percentage of the run time.

                        The concern isn't why the insertion is slow, but
rather why the scaling curve looks the way it does. Looking at the data,
it looks like the insertion rate (blocks per second) is actually related
as 1/n where N is the number of blocks. Attaching another graph of the
same data which I think is a little clearer to read.

                                Also, I wonder if the cluster is trying
to rebalance blocks toward the
                                end of your runtime (if the balancer
daemon is running) and this is
                                causing additional shuffling of data.

                        That's certainly one possibility.

                        Zlatin: here's a test to try: after the FS is
full with 400,000 blocks, let the cluster sit for a few hours, then come
back and start another insertion. Is the rate slow, or does it return to
the fast starting speed?

                        -Todd

-- 
Connect to me at http://www.facebook.com/dhruba

_______________________________________________

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
sent from other members of the Barclays Group.
_______________________________________________

RE: Exponential performance decay when inserting large number of blocks

Reply via email to