The "Verification succeeded" messages are from a Datanode background 
housekeeping task, DataBlockScanner, which attempts to discover any replicas 
that have become corrupt.  If it finds one (which should be rare), it tells the 
Namenode the replica has become corrupted, and the NN will re-replicate it from 
a good copy on another DN.

DataBlockScanner may consume up to 100% of one CPU core on the DN, but no more. 
 It is very unlikely to have caused the DN to become unable to do its 
high-priority work, like sending heartbeats and responding to Clients.  Unless 
you're running DN on single-core boxes, look to network problems or Namenode 
overload as more likely explanations for the problem.

One other possibility: were the "lost heartbeat" logs from startup time of a 
large cluster?  In v20, prior to a set of startup performance improvements that 
a few of us did over the first few months of this year, it was not uncommon for 
the NN to get swamped during startup of a large cluster, and start losing 
heartbeats and removing healthy nodes.  This was directly addressed in trunk 
and 20-security by HDFS-1541 (patch by Hairong Kuang).

--Matt


On May 31, 2011, at 4:10 AM, Joey Echeverria wrote:

How much memory do you have on your DataNode? Is it possible that
you're swapping?

-Joey

On Mon, May 30, 2011 at 11:09 PM, ccxixicc <ccxix...@foxmail.com> wrote:
> 
> Hi,all
> I found NameNode often lost heartbeat from DataNodes:
> org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost
> heartbeat from 192.168.1.101:50010
> org.apache.hadoop.net.NetworkTopology: Removing a node:
> /default-rack/102.168.1.101:50010
> 
> meanwhile NN logs:
> org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
> blockMap updated: 192.168.1.102:50010 is added to blk_16634224072...
> 
> And DN logs:
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_1820616086..
> 
> There's no DFSClients, I do nothing, What are the NN and DN doing? Almost
> 100% cpu. Is this why NN lost heartbeat from DN?
> 
> Thanks.
> 
> 



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Reply via email to