Re: Exponential performance decay when inserting large number of blocks

Todd Lipcon Thu, 14 Jan 2010 09:48:33 -0800

On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <alex.ka...@gmail.com> wrote:


> >I'm doing the insert from a node on the same rack as the cluster but it
> is not part of it.
> So it looks like you are copying from a single node, I'd try to run the
> inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or
> network bottleneck on the "source" node. Try to upload from multiple
> locations.
>
>
You're still missing the point here, Alex: it's not about the performance,
it's about the scaling curve.

-Todd


>
> On Thu, Jan 14, 2010 at 12:13 PM, <zlatin.balev...@barclayscapital.com>wrote:
>
>>   More general info:
>>
>> I'm doing the insert from a node on the same rack as the cluster but it is
>> not part of it.  Data is being read from a local disk and the datanodes
>> store to the local partitions as well.  The filesystem is ext3, but if it
>> were an inode issue the 4-node cluster would perform much worse than the
>> 11-node.  No MR jobs or any other activity is present - these are test
>> clusters that I create and remove with HOD.  I'm using revision 897952 (the
>> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few
>> days ago.
>>
>> Todd:
>>
>> I will repeat the test, waiting several hours after the first round of
>> inserts.  Unless the balancer daemon starts by default, I have not started
>> it.  The datablocks seemed uniformly spread amongst the datanodes.  I've
>> added two additional metrics to be recorded by the datanode -
>> DataNode.xmitsInProgress and DataNode.getXCeiverCount().  These are polled
>> every 10 seconds.  If anyone wants me to add additional metrics at any
>> component let me know.
>>
>>  > The test case is making files with a ton of blocks. Appending a block
>> to an end of a file might be O(n) -
>> > usually this isn't a problem since even large files are almost always
>> <100 blocks, and the majority <10.
>> > In the test, there are files with 50,000+ blocks, so O(n) runtime
>> anywhere in the blocklist for a file is pretty bad.
>>
>>  The files in the first test are 16k blocks each.  I am inserting the
>> same file under different filenames in consecutive runs.  If that were the
>> reason, the first insert should take the same amount of time as the last.
>> Nevertheless, I will run the next test with 4k blocks per file and increase
>> the number of consecutive insertions.
>>
>> Dhruba:
>> Unless there is a very high number of collisions, the hashmap should
>> perform in constant time.  Even if there were collisions, I would be seeing
>> much higher CPU usage on the NameNode.  According to the metrics I've
>> already sent, towards the end of the test the capacity of the BlockMap
>> was 512k and the load approaching 0.66.
>>
>> Best Regards,
>> Zlatin Balevsky
>>
>> P.S. I could not find contact info for the HOD developers.  I'd like to
>> ask them to document the "walltime" and "idleness-limit" parameters!
>>
>>  ------------------------------
>> *From:* Dhruba Borthakur [mailto:dhr...@gmail.com]
>> *Sent:* Thursday, January 14, 2010 9:04 AM
>>
>> *To:* hdfs-user@hadoop.apache.org
>> *Subject:* Re: Exponential performance decay when inserting large number
>> of blocks
>>
>> Here is another thing that came to my mind.
>>
>> The Namenode has a hash map in memory where it inserts all blocks. when a
>> new block needs to be allocated, the namenode first generates a random
>> number and checks to see if ti exists in the hashmap. If it does not exist
>> in the hash map, then that number is the block id of the to-be-allocated
>> block. The namenode then inserts this number into the hash map and sends it
>> to te client. The client receives it as the blockid and uses it to write
>> data to the datanode(s).
>>
>> One possibility is that that the time to do a hash-lookup varies depending
>> on the number of blocks in the hash.
>>
>> dhruba
>>
>>
>>
>>
>> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.ka...@gmail.com> wrote:
>>
>>> >launched 8 instances of the bin/hadoop fs -put utility
>>> Zlatin, may be a silly question, are you running dfs -put locally on each
>>> datanode,  or from a single box
>>> Also where are you copying the data from, do you have local copies on
>>> each node before the insert or all your files reside on a single server, or
>>> may be on NFS?
>>> i would also chk the network stats on datanodes and namenode and see if
>>> the nics are not saturated, i guess you have enough bandwidth but may be
>>> there is some issue with NIC on the namenode or something, i saw strange
>>> things happening. you can probably monitor the number of conections/sockets,
>>> bandwidth, IO waits, # of threads
>>> if you are writing to dfs from a single location may be there is a
>>> problem on a single node to handle all this outbound traffic, if you are
>>> distributing files in parallel from multiple nodes, than mat be there is an
>>> inbound congestion on namenode or something like that
>>>
>>> if its not the case, i'd explore using distcp utility for copying data in
>>> parallel  (it comes with the distro)
>>> also if you really hit a wall, and have some time, i'd take look at
>>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>>
>>>
>>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> Err, ignore that attachment - attached the wrong graph with the right
>>>> labels!
>>>>
>>>> Here's the right graph.
>>>>
>>>> -Todd
>>>>
>>>>
>>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>>
>>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <e...@lifeless.net>wrote:
>>>>>
>>>>>> On 1/13/10 8:12 PM, zlatin.balev...@barclayscapital.com wrote:
>>>>>> > Alex, Dhruba
>>>>>> >
>>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>>> doing
>>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>>> > also running iostat on one of the datanodes.  Did not notice
>>>>>> anything
>>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>>> > while the inserts were active but far from the limits of the disk
>>>>>> system.
>>>>>>
>>>>>> While creating many blocks, could it be that the replication pipe
>>>>>> lining
>>>>>> is eating up the available handler threads on the data nodes? By
>>>>>> increasing the block size you would see better performance because the
>>>>>> system spends more time writing data to local disk and less time
>>>>>> dealing
>>>>>> with things like replication "overhead." At a small block size, I
>>>>>> could
>>>>>> imagine you're artificially creating a situation where you saturate
>>>>>> the
>>>>>> default size configured thread pools or something weird like that.
>>>>>>
>>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>>> is
>>>>>> if testing with an artificially small block size like this is even a
>>>>>> viable test. At some point the overhead of talking to the name node,
>>>>>> selecting data nodes for a block, and setting up replication pipe
>>>>>> lines
>>>>>> could become some abnormally high percentage of the run time.
>>>>>>
>>>>>>
>>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>>> curve looks the way it does. Looking at the data, it looks like the
>>>>> insertion rate (blocks per second) is actually related as 1/n where N is 
>>>>> the
>>>>> number of blocks. Attaching another graph of the same data which I think 
>>>>> is
>>>>> a little clearer to read.
>>>>>
>>>>>
>>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>>> causing additional shuffling of data.
>>>>>>
>>>>>
>>>>> That's certainly one possibility.
>>>>>
>>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>>> let the cluster sit for a few hours, then come back and start another
>>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>>
>>>>> -Todd
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Connect to me at http://www.facebook.com/dhruba
>>
>> _______________________________________________
>>
>>
>>
>> This e-mail may contain information that is confidential, privileged or
>> otherwise protected from disclosure. If you are not an intended recipient of
>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>> it and any attachments and notify the sender that you have received it in
>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>> sell or a solicitation to buy or sell any securities, investment products or
>> other financial product or service, an official confirmation of any
>> transaction, or an official statement of Barclays. Any views or opinions
>> presented are solely those of the author and do not necessarily represent
>> those of Barclays. This e-mail is subject to terms available at the
>> following link: www.barcap.com/emaildisclaimer. By messaging with
>> Barclays you consent to the foregoing.  Barclays Capital is the
>> investment banking division of Barclays Bank PLC, a company registered in
>> England (number 1026167) with its registered office at 1 Churchill Place,
>> London, E14 5HP.  This email may relate to or be sent from other members
>> of the Barclays Group.**
>>
>> _______________________________________________
>>
>
>

Re: Exponential performance decay when inserting large number of blocks

Reply via email to