Re: Exponential performance decay when inserting large number of blocks

Todd Lipcon Thu, 14 Jan 2010 07:56:49 -0800

I have another conjecture about this:

The test case is making files with a ton of blocks. Appending a block to an
end of a file might be O(n) - usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. In the test,
there are files with 50,000+ blocks, so O(n) runtime anywhere in the
blocklist for a file is pretty bad.


-Todd

On Thu, Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <dhr...@gmail.com> wrote:

> Here is another thing that came to my mind.
>
> The Namenode has a hash map in memory where it inserts all blocks. when a
> new block needs to be allocated, the namenode first generates a random
> number and checks to see if ti exists in the hashmap. If it does not exist
> in the hash map, then that number is the block id of the to-be-allocated
> block. The namenode then inserts this number into the hash map and sends it
> to te client. The client receives it as the blockid and uses it to write
> data to the datanode(s).
>
> One possibility is that that the time to do a hash-lookup varies depending
> on the number of blocks in the hash.
>
> dhruba
>
>
>
>
>
> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.ka...@gmail.com> wrote:
>
>> >launched 8 instances of the bin/hadoop fs -put utility
>> Zlatin, may be a silly question, are you running dfs -put locally on each
>> datanode,  or from a single box
>> Also where are you copying the data from, do you have local copies on each
>> node before the insert or all your files reside on a single server, or may
>> be on NFS?
>> i would also chk the network stats on datanodes and namenode and see if
>> the nics are not saturated, i guess you have enough bandwidth but may be
>> there is some issue with NIC on the namenode or something, i saw strange
>> things happening. you can probably monitor the number of conections/sockets,
>> bandwidth, IO waits, # of threads
>> if you are writing to dfs from a single location may be there is a problem
>> on a single node to handle all this outbound traffic, if you are
>> distributing files in parallel from multiple nodes, than mat be there is an
>> inbound congestion on namenode or something like that
>>
>> if its not the case, i'd explore using distcp utility for copying data in
>> parallel  (it comes with the distro)
>> also if you really hit a wall, and have some time, i'd take look at
>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>
>>
>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> Err, ignore that attachment - attached the wrong graph with the right
>>> labels!
>>>
>>> Here's the right graph.
>>>
>>> -Todd
>>>
>>>
>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <e...@lifeless.net> wrote:
>>>>
>>>>> On 1/13/10 8:12 PM, zlatin.balev...@barclayscapital.com wrote:
>>>>> > Alex, Dhruba
>>>>> >
>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>> doing
>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>> > also running iostat on one of the datanodes.  Did not notice anything
>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>> > while the inserts were active but far from the limits of the disk
>>>>> system.
>>>>>
>>>>> While creating many blocks, could it be that the replication pipe
>>>>> lining
>>>>> is eating up the available handler threads on the data nodes? By
>>>>> increasing the block size you would see better performance because the
>>>>> system spends more time writing data to local disk and less time
>>>>> dealing
>>>>> with things like replication "overhead." At a small block size, I could
>>>>> imagine you're artificially creating a situation where you saturate the
>>>>> default size configured thread pools or something weird like that.
>>>>>
>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>> is
>>>>> if testing with an artificially small block size like this is even a
>>>>> viable test. At some point the overhead of talking to the name node,
>>>>> selecting data nodes for a block, and setting up replication pipe lines
>>>>> could become some abnormally high percentage of the run time.
>>>>>
>>>>>
>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>> curve looks the way it does. Looking at the data, it looks like the
>>>> insertion rate (blocks per second) is actually related as 1/n where N is 
>>>> the
>>>> number of blocks. Attaching another graph of the same data which I think is
>>>> a little clearer to read.
>>>>
>>>>
>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>> causing additional shuffling of data.
>>>>>
>>>>
>>>> That's certainly one possibility.
>>>>
>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>> let the cluster sit for a few hours, then come back and start another
>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>
>>>> -Todd
>>>>
>>>
>>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Exponential performance decay when inserting large number of blocks

Reply via email to