@Lars -- We're configured like in the "Getting Started" guide --
tried to
follow the instructions verbatim.
Ok. Thanks for the tip. I'm not sure what causes the orphan
behavior, but
thats good to know. I'll also know what to look for in the jstack.
After
stopping the region servers individually, then doing a stop /
restart, I get
regionserver logs
Here is M2 (chanel)
Wed Nov 11 08:49:30 EST 2009 Starting regionserver on chanel
ulimit -n 1024
2009-11-11 08:49:32,081 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
vmInputArguments=[-Xmx1000m, -XX:+HeapDumpOnOutOfMemoryError,
-XX:+UseConcMarkSweepGC, -XX:+CMSIncrementalMode, -Dhbase.lo
2009-11-11 08:49:32,229 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: My address is
chanel.local:60020
2009-11-11 08:49:32,418 INFO
org.apache.hadoop.hbase.ipc.HBaseRpcMetrics:
Initializing RPC Metrics with hostName=HRegionServer, port=60020
2009-11-11 08:49:32,629 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
globalMemStoreLimit=398.7m, globalMemStoreLimitLowMark=249.2m,
maxHeap=996.8m
2009-11-11 08:49:32,641 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every
10000000ms
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.2.1-808558, built on 08/27/2009
18:48 GMT
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:host.name=chanel.local
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.version=1.6.0_16
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.vendor=Sun Microsystems Inc.
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.16/jre
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/hadoop/hbase-0.20.1/bin/../conf:/
usr/lib/jvm/java-6-sun/lib/tools.jar:/opt/hadoop/hbase-0.20.1/bin/..:
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/hadoop/hbase-0.20.1/bin/../lib/
native/Linux-i386-32
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/tmp
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.name=Linux
2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.arch=i386
2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.version=2.6.24-24-generic
2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.name=hadoop
2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.home=/home/hadoop
2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.dir=/opt/hadoop/hbase-0.20.1
2009-11-11 08:49:32,760 INFO org.apache.zookeeper.ZooKeeper:
Initiating
client connection,
connectString=crunch2:2181,chanel2:2181,chanel:2181,chris:
2181,crunch3:2181
sessionTimeout=60000 watcher=org.apa
2009-11-11 08:49:32,781 INFO org.apache.zookeeper.ClientCnxn:
zookeeper.disableAutoWatchReset is false
2009-11-11 08:49:32,800 INFO org.apache.zookeeper.ClientCnxn:
Attempting
connection to server crunch2/172.16.1.95:2181
2009-11-11 08:49:32,802 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected local=/
172.16.1.45:33873 remote=crunch2/172.16.1.95:2181]
2009-11-11 08:49:32,814 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2009-11-11 08:49:32,829 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
event,
state: SyncConnected, type: None, path: null
2009-11-11 08:49:32,973 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
172.16.1.46:60000 that we are up
2009-11-11 08:49:33,509 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us
address
to use. Was=172.16.1.45:60020, Now=172.16.1.45
2009-11-11 08:49:34,090 INFO
org.apache.hadoop.hbase.regionserver.HLog: HLog
configuration: blocksize=67108864, rollsize=63753420, enabled=true,
flushlogentries=100, optionallogflushinternal=10000ms
2009-11-11 08:49:34,281 INFO
org.apache.hadoop.hbase.regionserver.HLog: New
hlog /hbase/.logs/chanel.local,60020,1257947373328/hlog.dat.
1257947374090
2009-11-11 08:49:34,319 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=RegionServer,
sessionId=regionserver/172.16.1.45:60020
2009-11-11 08:49:34,350 INFO
org.apache.hadoop.hbase.regionserver.metrics.RegionServerMetrics:
Initialized
2009-11-11 08:49:34,723 INFO org.apache.hadoop.http.HttpServer: Port
returned by webServer.getConnectors()[0].getLocalPort() before open
() is -1.
Opening the listener on 60030
2009-11-11 08:49:34,724 INFO org.apache.hadoop.http.HttpServer:
listener.getLocalPort() returned 60030
webServer.getConnectors()[0].getLocalPort() returned 60030
2009-11-11 08:49:34,724 INFO org.apache.hadoop.http.HttpServer:
Jetty bound
to port 60030
2009-11-11 08:49:35,673 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
Responder: starting
2009-11-11 08:49:35,674 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
listener on 60020: starting
2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 0 on 60020: starting
2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 1 on 60020: starting
2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 2 on 60020: starting
2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 3 on 60020: starting
2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 4 on 60020: starting
2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 5 on 60020: starting
2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 6 on 60020: starting
2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 7 on 60020: starting
2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 8 on 60020: starting
2009-11-11 08:49:35,676 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: HRegionServer
started
at: 172.16.1.45:60020
2009-11-11 08:49:35,677 INFO org.apache.hadoop.ipc.HBaseServer: IPC
Server
handler 9 on 60020: starting
2009-11-11 08:49:35,694 INFO
org.apache.hadoop.hbase.regionserver.StoreFile:
Allocating LruBlockCache with maximum size 199.4m
2009-11-11 08:49:38,798 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
.META.,,1
2009-11-11 08:49:38,811 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
MSG_REGION_OPEN:
.META.,,1
2009-11-11 08:49:38,932 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
region .META.,,1/1028785192 available; sequence id is 0
The only difference between that log file and the rest is that the
others
don't have
2009-11-11 08:49:38,798 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
.META.,,1
2009-11-11 08:49:38,811 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
MSG_REGION_OPEN:
.META.,,1
2009-11-11 08:49:38,932 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
region .META.,,1/1028785192 available; sequence id is 0
except M5 which has it and says id is 3....if you want me to paste
all the
logs, I can...didn't think it was necessary.
hbase(main):001:0> zk_dump
HBase tree in ZooKeeper is rooted at /hbase
Cluster up? true
In safe mode? false
Master address: 172.16.1.46:60000
Region server holding ROOT: 172.16.1.96:60020
Region servers:
- 172.16.1.95:60020
- 172.16.1.96:60020
- 172.16.1.83:60020
- 172.16.1.45:60020
hbase(main):002:0> status 'simple'
4 live servers
crunch3.local:60020 1257928281006
requests=0, regions=1, usedHeap=31, maxHeap=998
chris.local:60020 1257947367197
requests=0, regions=0, usedHeap=29, maxHeap=996
crunch2.local:60020 1257947370011
requests=0, regions=0, usedHeap=29, maxHeap=998
chanel.local:60020 1257947373328
requests=0, regions=1, usedHeap=31, maxHeap=996
0 dead servers
I started and stopped Hbase a couple times without deleting the HDFS
copy,
and everything seems to work fine....was that it??? What caused the
regionservers to orphan?
On Wed, Nov 11, 2009 at 5:17 AM, Lars George <[email protected]>
wrote:
Hi Chris,
As Tatsuya says, but I would expect you have to "kill" them as they
are in
a state where a daemon stop may not work as they will not be
perceptible to
that event yet.
I saw the same as Tatsuya and assumed you have an issue with your
quorum
setting on the slaves. I have not followed the whole discussion in
all the
details so let me ask you how you configured it?
I have personally done a symlink of my standalone zoo.cfg from the
zookeeper/conf to the hbase/conf directory. You could also have set
the
quorum servers in the hbase-site.xml as per the "Getting started"
guide.
If you have done one of those ways already then make sure all ZK
servers
are up and can be reached from the region servers.
Lars
Tatsuya Kawano schrieb:
Hi Chris, and thanks Lars for help.
OK. So "jstack 22200" shows your region server is trying to finish
starting up, but stuck in a middle when try to get IP address of the
master from ZooKeeper.
===========================================================
"main" prio=10 tid=0x0805a800 nid=0x56d2 waiting on condition
[0xb72f2000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:51)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(
HRegionServer.java:387)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(
HRegionServer.java:315)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize
(HRegionServer.java:306)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.<init>
(HRegionServer.java:276)
===========================================================
I still need to see the regionserver logs to figure out why this is
happening.
Also,
This is what I see when I run start-hbase.sh -- I can ssh into
any of the
boxes with no password just fine, it just gives me a weird first
time
host
message...we get the same thing when we start up hadoop.
...
crunch2: regionserver running as process 6950. Stop it first.
chanel: regionserver running as process 22200. Stop it first.
crunch3: regionserver running as process 28962. Stop it first.
chris: regionserver running as process 28719. Stop it first.
This "Stop it first" message means your region servers didn't stop
when you ran stop-hbase.sh. The master couldn't locate those region
servers so it couldn't tell them to shutdown. This is why you've got
those orphan region servers. So until we finish setting your HBase
cluster up, you'll have to stop those region servers by hand.
To do this, ssh to M2 -- M5, and type the following command:
${HBASE_HOME}/bin/hbase-daemon.sh stop regionserver
Then jps again to make sure HRegionServer doesn't exist. If the
above
command doesn't work, you can use Unix "kill" command. Then ssh to
M1,
run stop-hbase.sh to stop the master and ZooKeepers.
It's still a mystery you don't have regionserver logs while you have
zookeeper logs. Maybe those orphan region servers was the reason? I
don't know, but you can give it another try after stopping them. So,
try to stop whole HBase / ZooKeeper process by above way, then run
start-hbase.sh once again. If you can get the regionserver log to
us,
that would be great.
Thanks,