Re: Performance / cluster scaling question

Ted Dunning Fri, 21 Mar 2008 14:57:05 -0700

The delay may be in reporting the deleted blocks as free on the web
interface as much as in actually marking them as deleted.



On 3/21/08 2:48 PM, "André Martin" <[EMAIL PROTECTED]> wrote:

> Right, I totally forgot about the replication factor... However
> sometimes I even noticed ratios of 5:1 for block numbers to files...
> Is the delay for block deletion/reclaiming an intended behavior?
> 
> Jeff Eastman wrote:
>> That makes the math come out a lot closer (3*423763=1271289). I've also
>> noticed there is some delay in reclaiming unused blocks so what you are
>> seeing in terms of block allocations do not surprise me.
>> 
>>   
>>> -----Original Message-----
>>> From: André Martin [mailto:[EMAIL PROTECTED]
>>> Sent: Friday, March 21, 2008 2:36 PM
>>> To: [email protected]
>>> Subject: Re: Performance / cluster scaling question
>>> 
>>> 3 - the default one...
>>> 
>>> Jeff Eastman wrote:
>>>     
>>>> What's your replication factor?
>>>> Jeff
>>>> 
>>>> 
>>>>       
>>>>> -----Original Message-----
>>>>> From: André Martin [mailto:[EMAIL PROTECTED]
>>>>> Sent: Friday, March 21, 2008 2:25 PM
>>>>> To: [email protected]
>>>>> Subject: Performance / cluster scaling question
>>>>> 
>>>>> Hi everyone,
>>>>> I ran a distributed system that consists of 50 spiders/crawlers and 8
>>>>> server nodes with a Hadoop DFS cluster with 8 datanodes and a
>>>>>         
>>> namenode...
>>>     
>>>>> Each spider has 5 job processing / data crawling threads and puts
>>>>> crawled data as one complete file onto the DFS - additionally there are
>>>>> splits created for each server node that are put as files onto the DFS
>>>>> as well. So basically there are 50*5*9 = ~2250 concurrent writes across
>>>>> 8 datanodes.
>>>>> The splits are read by the server nodes and will be deleted afterwards,
>>>>> so those (split)-files exists for only a few seconds to minutes...
>>>>> Since 99% of the files have a size of less than 64 MB (the default
>>>>>         
>>> block
>>>     
>>>>> size) I expected that the number of files is roughly equal to the
>>>>>         
>>> number
>>>     
>>>>> of blocks. After running the system for 24hours the namenode WebUI
>>>>>         
>>> shows
>>>     
>>>>> 423763 files and directories and 1480735 blocks. It looks like that the
>>>>> system does not catch up with deleting all the invalidated blocks - my
>>>>> assumption?!?
>>>>> Also, I noticed that the overall performance of the cluster goes down
>>>>> (see attached image).
>>>>> There are a bunch of Could not get block locations. Aborting...
>>>>> exceptions and those exceptions seem to appear more frequently towards
>>>>> the end of the experiment.
>>>>> 
>>>>>         
>>>>>> java.io.IOException: Could not get block locations. Aborting...
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl
>>>     
>>>>> ient.java:1824)
>>>>> 
>>>>>         
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java
>>>     
>>>>> :1479)
>>>>> 
>>>>>         
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient
>>>     
>>>>> .java:1571)
>>>>> So, is the cluster simply saturated with the such a frequent creation
>>>>> and deletion of files, or is the network that actual bottleneck? The
>>>>> work load does not change at all during the whole experiment.
>>>>> On cluster side I see lots of the following exceptions:
>>>>> 
>>>>>         
>>> = >>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
>>>     
>>>>>> PacketResponder 1 for block blk_6757062148746339382 terminating
>>>>>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode:
>>>>>> writeBlock blk_6757062148746339382 received exception
>>>>>> 
>>>>>>           
>>>>> java.io.EOFException
>>>>> 
>>>>>         
>>>>>> 2008-03-21 20:28:05,411 ERROR org.apache.hadoop.dfs.DataNode:
>>>>>> 141.xxx..xxx.xxx:50010:DataXceiver: java.io.EOFException
>>>>>>     at java.io.DataInputStream.readInt(Unknown Source)
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
>>>     
>>>>> 63)
>>>>> 
>>>>>         
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
>>>     
>>>>>>     at
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
>>>     
>>>>>>     at java.lang.Thread.run(Unknown Source)
>>>>>> 2008-03-21 19:26:46,535 INFO org.apache.hadoop.dfs.DataNode:
>>>>>> writeBlock blk_-7369396710977076579 received exception
>>>>>> java.net.SocketException: Connection reset
>>>>>> 2008-03-21 19:26:46,535 ERROR org.apache.hadoop.dfs.DataNode:
>>>>>> 141.xxx.xxx.xxx:50010:DataXceiver: java.net.SocketException:
>>>>>> Connection reset
>>>>>>     at java.net.SocketInputStream.read(Unknown Source)
>>>>>>     at java.io.BufferedInputStream.fill(Unknown Source)
>>>>>>     at java.io.BufferedInputStream.read(Unknown Source)
>>>>>>     at java.io.DataInputStream.readInt(Unknown Source)
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22
>>>     
>>>>> 63)
>>>>> 
>>>>>         
>>>>>>     at
>>>>>> 
>>>>>> 
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150)
>>>     
>>>>>>     at
>>>>>>           
>>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
>>>     
>>>>>>     at java.lang.Thread.run(Unknown Source)
>>>>>> 
>>>>>>           
>>>>> I'm running Hadoop 0.16.1 - Has anyone made the same or a similar
>>>>> experience.
>>>>> How can the performance degradation be avoided? More datanodes? Why
>>>>> seems the block deletion not to catch up with the deletion of the file?
>>>>> Thanks in advance for your insights, ideas & suggestions :-)
>>>>> 
>>>>> Cu on the 'net,
>>>>>                         Bye - bye,
>>>>> 
>>>>>                                    <<<<< André <<<< >>>> èrbnA >>>>>
>>>>> 
>>>>>         
>> 
>> 
>>   
> 
>

Re: Performance / cluster scaling question

Reply via email to