Hi all,

Two of our data nodes out of 60+ node cluster have had disk error.
We have not recognized it until we have encountered unusual job failure 
after
several phases of M/R jobs(i.e. 1st M/R - 2nd M/R - 3rd M/R ...) and survey.
The error message of jobtracker was

    2009-..-.. INFO org.apache.hadoop.mapred.TaskInProgress: Error from 
attempt_ ...: java.io.IOException: Could not obtain block: blk_..... 
file=/user/.../part-00173

The file "part-00173" was to be generated as output file of some earlier 
phase of M/R.
We've tried to look into the file part-00173 of HDFS to have found a message
at the bottom of screen (<another_data_node>:50075) saying

    java.io.IOException: No nodes contain this block

Why is this? When does the replication really happen?
The file "part-00173" must have been moved from "attempt_..."
after job ends successfully and have had predetermined replications, right?
It seems that the part-00173 is not replicated enough.

What is worse, there are no ERROR labelled messages in the log other than 
datanode.
Actually, our system reports any ERROR labelled message from NameNode
and JobTracker logs. But if this (probably) critical error is labelled as 
INFO,
we need to re-design our monitoring policy. In order to find out any disk 
failures,
do we need to scan DataNode's logs also?

Any help will be appreciated.


Thanks,
Manhee 

Reply via email to