Michele Catasta wrote:
Hi Michael,

thanks for the detailed answer, it has been helpful (especially the
log4j DEBUG level for all that classes).

Check the logs to see if you can get a clue as to what is going on.  Did
the cluster HMaster get the shutdown signal?   (Is it running the
shutdown sequence?)  Logs are in $HADOOP_HOME/logs.  Look at the
hbase-USERID-master-*log content.  Might help if you up the log level to
DEBUG (add the line 'log4j.logger.org.apache.hadoop.hbase.HMaster=DEBUG'
to $HADOOP_HOME/conf/log4j.properites).  Stack traces are also useful
figuring where the programs are hung (Send a 'kill -QUIT PROCESS_ID'.
The output will appear in the '*.out' logs).

I've been able to reproduce the shutdown process. Basically, we were
deploying our hadoop+hbase installation using a little script to
automatize the boring task.
The problem is that we called stop-hbase.sh and soon after stop-all.sh
for hadoop platform.
Considering that the stop-hbase.sh returns immediately after it has
been launched, and instead hbase takes a while to properly shutdown...
we basically killed hadoop (and all its RPC facilities that are used
by hbase) while hbase was shutting down.

Maybe I didn't get the whole picture correctly, but I've been able to
solve the problem with a 'wait until hbase shuts down' in the script.

hbase needs to do an orderly shutdown (When hdfs gets the pending write-append 
feature, this will be less of an issue).  If hdfs is unceremoniously pulled out 
from under a running hbase, hbase gets hung-up trying to relocate the absconded 
filesystem.

I'll create an issue for making hbase behave better in this scenario (And it 
seems like we should also promote DEBUG-level logging to INFO-level so users do 
not have to go messing in log4j properties just to figure why their cluster is 
sick).


The outstanding log on improper shutdown should have been addressed by
HADOOP-1527.

2007-08-28 04:33:01,316 ERROR org.apache.hadoop.hbase.HRegionServer:
Can not start region server because
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.SafeModeException: Cannot create directory
/*******/****************/hbase/log_xxx.xxx.xxx.xxx_60010. Name node
is in safe mode.
Safe mode will be turned off automatically.

hdfs on startup is unwritable for a short period of time (this is 'safe mode'). During this time the namenode is waiting to see how many datanodes are reporting in so it can make a call on the state of the filesystem before it starts to offer service (e.g. if some datanodes fail to report in, it knows it must start to replicate the blocks these dead datanodes were carrying). Try waiting till hdfs has left 'safe mode' before starting hbase (One way to tell that hdfs has left 'safe mode' is by looking at the hdfs UI. By default its on port 50070 on the namenode host. Another is by running '$HADOOP_HOME/bin/hadoop dfsadmin -safemode get').

I'll make an issue so hbase scripts wait until hdfs is out of 'safe mode' before preceding w/ hbase launch.
Even if I shutdown properly, I can reproduce this problem every time I
try to restart hbase. The first time I restart it, the HRegionServer
finds its log directory in 'safe mode'. After it triggered the name
node to turn off the safe mode, I can restart hbase without problems.

Is it in someway related to HADOOP-1527? Probably, before the
stutdown, setSafeMode() method is called on log directory. I tried
also to wait a good amount of time, but there are no TTLs if I'm not
wrong (and I didn't find any of them in the sources).
So, HADOOP-1527 makes it so that if a hregionserver crashed, on restart, the log file automatically gets parsed and its edits properly distributed. So, you should not be seeing problems starting hbase because of log file problems (excepting the above case where hbase cannot write the filesystem because it is 'safe mode').

Yours,
St.Ack

Reply via email to