Re: Debugging and fixing Safemode

Matthew Foley Mon, 14 Feb 2011 11:30:28 -0800

Hi Sandhya,
the threshold for leaving safemode automatically is configurable; it defaults 
to 0.999, but you can change parameter "dfs.namenode.safemode.threshold-pct" to 
a different floating-point number in your config.  It is set to almost 100% by 
default, on the theory that (a) if you didn't hit 100% it means some of your 
datanodes didn't come up or suffered data loss, and (b) you might want to know 
about that before letting the cluster start writing and changing files.


When the cluster comes out of safe mode, it should automatically fix any 
under-replicated blocks; you don't need to take action to fix them yourself.  
But any files that are damaged by loss of ALL replicas of a block will appear 
corrupted to applications.

You can run dfs fsck to identify problem files, and move them to lost+found or 
delete them.

Hope this helps,
--Matt


On Feb 11, 2011, at 11:53 AM, Edupuganti, Sandhya wrote:

Our Namenode is going into Safemode after every restart. It reports the ratio 
to be .98xxx whereas it is looking for 0.999 to leave the safe mode. So I'm 
guessing there must be one or two files that are under replicated.

Is there any way I can find out which files are under replicated, so that I can 
re copy them if I have or delete them.

I don’t want to end up with a silent Namenode in a safemode next time and 
causing all our jobs to fail.

Any pointers will be greatly appreciated

Many Thanks
Sandhya

Re: Debugging and fixing Safemode

Reply via email to