Hi Chris,
As Tatsuya says, but I would expect you have to "kill" them as they are
in a state where a daemon stop may not work as they will not be
perceptible to that event yet.
I saw the same as Tatsuya and assumed you have an issue with your quorum
setting on the slaves. I have not followed the whole discussion in all
the details so let me ask you how you configured it?
I have personally done a symlink of my standalone zoo.cfg from the
zookeeper/conf to the hbase/conf directory. You could also have set the
quorum servers in the hbase-site.xml as per the "Getting started" guide.
If you have done one of those ways already then make sure all ZK servers
are up and can be reached from the region servers.
Lars
Tatsuya Kawano schrieb:
Hi Chris, and thanks Lars for help.
OK. So "jstack 22200" shows your region server is trying to finish
starting up, but stuck in a middle when try to get IP address of the
master from ZooKeeper.
===========================================================
"main" prio=10 tid=0x0805a800 nid=0x56d2 waiting on condition [0xb72f2000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:51)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HRegionServer.java:387)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(HRegionServer.java:315)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionServer.java:306)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:276)
===========================================================
I still need to see the regionserver logs to figure out why this is happening.
Also,
This is what I see when I run start-hbase.sh -- I can ssh into any of the
boxes with no password just fine, it just gives me a weird first time host
message...we get the same thing when we start up hadoop.
...
crunch2: regionserver running as process 6950. Stop it first.
chanel: regionserver running as process 22200. Stop it first.
crunch3: regionserver running as process 28962. Stop it first.
chris: regionserver running as process 28719. Stop it first.
This "Stop it first" message means your region servers didn't stop
when you ran stop-hbase.sh. The master couldn't locate those region
servers so it couldn't tell them to shutdown. This is why you've got
those orphan region servers. So until we finish setting your HBase
cluster up, you'll have to stop those region servers by hand.
To do this, ssh to M2 -- M5, and type the following command:
${HBASE_HOME}/bin/hbase-daemon.sh stop regionserver
Then jps again to make sure HRegionServer doesn't exist. If the above
command doesn't work, you can use Unix "kill" command. Then ssh to M1,
run stop-hbase.sh to stop the master and ZooKeepers.
It's still a mystery you don't have regionserver logs while you have
zookeeper logs. Maybe those orphan region servers was the reason? I
don't know, but you can give it another try after stopping them. So,
try to stop whole HBase / ZooKeeper process by above way, then run
start-hbase.sh once again. If you can get the regionserver log to us,
that would be great.
Thanks,