FYI, Just ran a 50 node cluster using one of the new kernels for Fedora with all nodes forced onto the same 'availability zone' and there were no timeouts or failed writes.

On Mar 27, 2008, at 4:16 PM, Chris K Wensel wrote:
If it's any consolation, I'm seeing similar behaviors on 0.16.0 when running on EC2 when I push the number of nodes in the cluster past 40.

On Mar 24, 2008, at 6:31 AM, André Martin wrote:
Thanks for the clarification, dhruba :-)
Anyway, what can cause those other exceptions such as "Could not get block locations" and "DataXceiver: java.io.EOFException"? Can anyone give me a little more insight about those exceptions? And does anyone have a similar workload (frequent writes and deletion of small files), and what could cause the performance degradation (see first post)? I think HDFS should be able to handle two million and more files/blocks... Also, I observed that some of my datanodes do not "heartbeat" to the namenode for several seconds (up to 400 :-() from time to time - when I check those specific datanodes and do a "top", I see the "du" command running that seems to got stuck?!?
Thanks and Happy Easter :-)

Cu on the 'net,
                     Bye - bye,

                                <<<<< André <<<< >>>> èrbnA >>>>>

dhruba Borthakur wrote:

The namenode lazily instructs a Datanode to delete blocks. As a response to every heartbeat from a Datanode, the Namenode instructs it to delete a maximum on 100 blocks. Typically, the heartbeat periodicity is 3 seconds. The heartbeat thread in the Datanode deletes the block files synchronously before it can send the next heartbeat. That's the reason a small number (like 100) was chosen.

If you have 8 datanodes, your system will probably delete about 800 blocks every 3 seconds.

Thanks,
dhruba

-----Original Message-----
From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday, March 21, 2008 3:06 PM
To: core-user@hadoop.apache.org
Subject: Re: Performance / cluster scaling question

After waiting a few hours (without having any load), the block number and "DFS Used" space seems to go down... My question is: is the hardware simply too weak/slow to send the block deletion request to the datanodes in a timely manner, or do simply those "crappy" HDDs cause the delay, since I noticed that I can take up to 40 minutes when deleting ~400.000 files at once manually using "rm -r"... Actually - my main concern is why the performance à la the throughput goes down - any ideas?


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/




Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




Reply via email to