The delay may be in reporting the deleted blocks as free on the web interface as much as in actually marking them as deleted.
On 3/21/08 2:48 PM, "André Martin" <[EMAIL PROTECTED]> wrote: > Right, I totally forgot about the replication factor... However > sometimes I even noticed ratios of 5:1 for block numbers to files... > Is the delay for block deletion/reclaiming an intended behavior? > > Jeff Eastman wrote: >> That makes the math come out a lot closer (3*423763=1271289). I've also >> noticed there is some delay in reclaiming unused blocks so what you are >> seeing in terms of block allocations do not surprise me. >> >> >>> -----Original Message----- >>> From: André Martin [mailto:[EMAIL PROTECTED] >>> Sent: Friday, March 21, 2008 2:36 PM >>> To: [email protected] >>> Subject: Re: Performance / cluster scaling question >>> >>> 3 - the default one... >>> >>> Jeff Eastman wrote: >>> >>>> What's your replication factor? >>>> Jeff >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: André Martin [mailto:[EMAIL PROTECTED] >>>>> Sent: Friday, March 21, 2008 2:25 PM >>>>> To: [email protected] >>>>> Subject: Performance / cluster scaling question >>>>> >>>>> Hi everyone, >>>>> I ran a distributed system that consists of 50 spiders/crawlers and 8 >>>>> server nodes with a Hadoop DFS cluster with 8 datanodes and a >>>>> >>> namenode... >>> >>>>> Each spider has 5 job processing / data crawling threads and puts >>>>> crawled data as one complete file onto the DFS - additionally there are >>>>> splits created for each server node that are put as files onto the DFS >>>>> as well. So basically there are 50*5*9 = ~2250 concurrent writes across >>>>> 8 datanodes. >>>>> The splits are read by the server nodes and will be deleted afterwards, >>>>> so those (split)-files exists for only a few seconds to minutes... >>>>> Since 99% of the files have a size of less than 64 MB (the default >>>>> >>> block >>> >>>>> size) I expected that the number of files is roughly equal to the >>>>> >>> number >>> >>>>> of blocks. After running the system for 24hours the namenode WebUI >>>>> >>> shows >>> >>>>> 423763 files and directories and 1480735 blocks. It looks like that the >>>>> system does not catch up with deleting all the invalidated blocks - my >>>>> assumption?!? >>>>> Also, I noticed that the overall performance of the cluster goes down >>>>> (see attached image). >>>>> There are a bunch of Could not get block locations. Aborting... >>>>> exceptions and those exceptions seem to appear more frequently towards >>>>> the end of the experiment. >>>>> >>>>> >>>>>> java.io.IOException: Could not get block locations. Aborting... >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCl >>> >>>>> ient.java:1824) >>>>> >>>>> >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java >>> >>>>> :1479) >>>>> >>>>> >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient >>> >>>>> .java:1571) >>>>> So, is the cluster simply saturated with the such a frequent creation >>>>> and deletion of files, or is the network that actual bottleneck? The >>>>> work load does not change at all during the whole experiment. >>>>> On cluster side I see lots of the following exceptions: >>>>> >>>>> >>> = >>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode: >>> >>>>>> PacketResponder 1 for block blk_6757062148746339382 terminating >>>>>> 2008-03-21 20:28:05,411 INFO org.apache.hadoop.dfs.DataNode: >>>>>> writeBlock blk_6757062148746339382 received exception >>>>>> >>>>>> >>>>> java.io.EOFException >>>>> >>>>> >>>>>> 2008-03-21 20:28:05,411 ERROR org.apache.hadoop.dfs.DataNode: >>>>>> 141.xxx..xxx.xxx:50010:DataXceiver: java.io.EOFException >>>>>> at java.io.DataInputStream.readInt(Unknown Source) >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22 >>> >>>>> 63) >>>>> >>>>> >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150) >>> >>>>>> at >>>>>> >>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) >>> >>>>>> at java.lang.Thread.run(Unknown Source) >>>>>> 2008-03-21 19:26:46,535 INFO org.apache.hadoop.dfs.DataNode: >>>>>> writeBlock blk_-7369396710977076579 received exception >>>>>> java.net.SocketException: Connection reset >>>>>> 2008-03-21 19:26:46,535 ERROR org.apache.hadoop.dfs.DataNode: >>>>>> 141.xxx.xxx.xxx:50010:DataXceiver: java.net.SocketException: >>>>>> Connection reset >>>>>> at java.net.SocketInputStream.read(Unknown Source) >>>>>> at java.io.BufferedInputStream.fill(Unknown Source) >>>>>> at java.io.BufferedInputStream.read(Unknown Source) >>>>>> at java.io.DataInputStream.readInt(Unknown Source) >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:22 >>> >>>>> 63) >>>>> >>>>> >>>>>> at >>>>>> >>>>>> >>>>>> >>> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1150) >>> >>>>>> at >>>>>> >>> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) >>> >>>>>> at java.lang.Thread.run(Unknown Source) >>>>>> >>>>>> >>>>> I'm running Hadoop 0.16.1 - Has anyone made the same or a similar >>>>> experience. >>>>> How can the performance degradation be avoided? More datanodes? Why >>>>> seems the block deletion not to catch up with the deletion of the file? >>>>> Thanks in advance for your insights, ideas & suggestions :-) >>>>> >>>>> Cu on the 'net, >>>>> Bye - bye, >>>>> >>>>> <<<<< André <<<< >>>> èrbnA >>>>> >>>>> >>>>> >> >> >> > >
