Re: HBase 0.20.1 Distributed Install Problems

Chris Bates Wed, 11 Nov 2009 10:28:14 -0800

>From the complete newbie perspective, there is a tendency to rush through
the "Getting Started" guide to get HBase up and running quickly to start
playing with it.


Assuming the issues I ran into are relatively common for distributed
installs, I would say the main lessons I took away that weren't explicit in
the "Getting Started" guide are the following:

1)  Networking can be problematic.  Host settings are important.
2)  The distinction between ZooKeepers and RegionServers is important.
3)  How to debug, or what to look for in the logs  (zk_dump, status
'simple', prompts on start up, common hang-up messages in the logs)
4)  More examples of correct configurations and what outputs of those debug
routines should correctly say

Obviously you can't tackle every edge case.  Typically when I encounter an
error I just Google it and find a random blog post indicating what was done
to fix it.  However the community seems to be young and growing still.


On Wed, Nov 11, 2009 at 1:02 PM, stack <[email protected]> wrote:

> If there is anything you thing we should improve in our 'Getting Started',
> please say so.
> St.Ack
>
> On Wed, Nov 11, 2009 at 9:57 AM, Chris Bates <
> [email protected]> wrote:
>
> > Ok, thats what I was wondering. I started reading through the Hbase
> chapter
> > in OReilly's Hadoop book.
> >
> > All -- thank you for your help!  I hope to do a short write up in a blog
> I
> > just started.  Hopefully this discussion will be helpful to other new
> > users.
> >
> > On Wed, Nov 11, 2009 at 12:41 PM, Lars George <[email protected]>
> wrote:
> >
> > > Hi Chris,
> > >
> > > Please note that only one server gets to host the .META. table and
> > another
> > > the -ROOT- table. That is it. All user regions are then equally
> > distributed
> > > across all region servers.
> > >
> > > So it sounds like you are all set. Not sure what happened initially
> with
> > > the Zookeeper hang up. Keep an eye on it.
> > >
> > > Lars
> > >
> > >
> > > On Nov 11, 2009, at 15:12, Chris Bates <
> > [email protected]>
> > > wrote:
> > >
> > >  @Lars -- We're configured like in the "Getting Started" guide -- tried
> > to
> > >> follow the instructions verbatim.
> > >>
> > >> Ok. Thanks for the tip. I'm not sure what causes the orphan behavior,
> > but
> > >> thats good to know.  I'll also know what to look for in the jstack.
> >  After
> > >> stopping the region servers individually, then doing a stop / restart,
> I
> > >> get
> > >> regionserver logs
> > >>
> > >> Here is M2 (chanel)
> > >> Wed Nov 11 08:49:30 EST 2009 Starting regionserver on chanel
> > >> ulimit -n 1024
> > >> 2009-11-11 08:49:32,081 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> vmInputArguments=[-Xmx1000m, -XX:+HeapDumpOnOutOfMemoryError,
> > >> -XX:+UseConcMarkSweepGC, -XX:+CMSIncrementalMode, -Dhbase.lo
> > >> 2009-11-11 08:49:32,229 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: My address is
> > >> chanel.local:60020
> > >> 2009-11-11 08:49:32,418 INFO
> > org.apache.hadoop.hbase.ipc.HBaseRpcMetrics:
> > >> Initializing RPC Metrics with hostName=HRegionServer, port=60020
> > >> 2009-11-11 08:49:32,629 INFO
> > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> > >> globalMemStoreLimit=398.7m, globalMemStoreLimitLowMark=249.2m,
> > >> maxHeap=996.8m
> > >> 2009-11-11 08:49:32,641 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every
> > 10000000ms
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:zookeeper.version=3.2.1-808558, built on 08/27/2009 18:48
> > GMT
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:host.name=chanel.local
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.version=1.6.0_16
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.vendor=Sun Microsystems Inc.
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.home=/usr/lib/jvm/java-6-sun-1.6.0.16/jre
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >>
> > >>
> >
> environment:java.class.path=/opt/hadoop/hbase-0.20.1/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/opt/hadoop/hbase-0.20.1/bin/..:
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >>
> > >>
> >
> environment:java.library.path=/opt/hadoop/hbase-0.20.1/bin/../lib/native/Linux-i386-32
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.io.tmpdir=/tmp
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:java.compiler=<NA>
> > >> 2009-11-11 08:49:32,749 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.name=Linux
> > >> 2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.arch=i386
> > >> 2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:os.version=2.6.24-24-generic
> > >> 2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.name=hadoop
> > >> 2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.home=/home/hadoop
> > >> 2009-11-11 08:49:32,750 INFO org.apache.zookeeper.ZooKeeper: Client
> > >> environment:user.dir=/opt/hadoop/hbase-0.20.1
> > >> 2009-11-11 08:49:32,760 INFO org.apache.zookeeper.ZooKeeper:
> Initiating
> > >> client connection,
> > >>
> > >>
> >
> connectString=crunch2:2181,chanel2:2181,chanel:2181,chris:2181,crunch3:2181
> > >> sessionTimeout=60000 watcher=org.apa
> > >> 2009-11-11 08:49:32,781 INFO org.apache.zookeeper.ClientCnxn:
> > >> zookeeper.disableAutoWatchReset is false
> > >> 2009-11-11 08:49:32,800 INFO org.apache.zookeeper.ClientCnxn:
> Attempting
> > >> connection to server crunch2/172.16.1.95:2181
> > >> 2009-11-11 08:49:32,802 INFO org.apache.zookeeper.ClientCnxn: Priming
> > >> connection to java.nio.channels.SocketChannel[connected local=/
> > >> 172.16.1.45:33873 remote=crunch2/172.16.1.95:2181]
> > >> 2009-11-11 08:49:32,814 INFO org.apache.zookeeper.ClientCnxn: Server
> > >> connection successful
> > >> 2009-11-11 08:49:32,829 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> event,
> > >> state: SyncConnected, type: None, path: null
> > >> 2009-11-11 08:49:32,973 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
> > >> 172.16.1.46:60000 that we are up
> > >> 2009-11-11 08:49:33,509 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us
> > >> address
> > >> to use. Was=172.16.1.45:60020, Now=172.16.1.45
> > >> 2009-11-11 08:49:34,090 INFO
> org.apache.hadoop.hbase.regionserver.HLog:
> > >> HLog
> > >> configuration: blocksize=67108864, rollsize=63753420, enabled=true,
> > >> flushlogentries=100, optionallogflushinternal=10000ms
> > >> 2009-11-11 08:49:34,281 INFO
> org.apache.hadoop.hbase.regionserver.HLog:
> > >> New
> > >> hlog
> > /hbase/.logs/chanel.local,60020,1257947373328/hlog.dat.1257947374090
> > >> 2009-11-11 08:49:34,319 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > >> Initializing JVM Metrics with processName=RegionServer,
> > >> sessionId=regionserver/172.16.1.45:60020
> > >> 2009-11-11 08:49:34,350 INFO
> > >> org.apache.hadoop.hbase.regionserver.metrics.RegionServerMetrics:
> > >> Initialized
> > >> 2009-11-11 08:49:34,723 INFO org.apache.hadoop.http.HttpServer: Port
> > >> returned by webServer.getConnectors()[0].getLocalPort() before open()
> is
> > >> -1.
> > >> Opening the listener on 60030
> > >> 2009-11-11 08:49:34,724 INFO org.apache.hadoop.http.HttpServer:
> > >> listener.getLocalPort() returned 60030
> > >> webServer.getConnectors()[0].getLocalPort() returned 60030
> > >> 2009-11-11 08:49:34,724 INFO org.apache.hadoop.http.HttpServer: Jetty
> > >> bound
> > >> to port 60030
> > >> 2009-11-11 08:49:35,673 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> Responder: starting
> > >> 2009-11-11 08:49:35,674 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> listener on 60020: starting
> > >> 2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 0 on 60020: starting
> > >> 2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 1 on 60020: starting
> > >> 2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 2 on 60020: starting
> > >> 2009-11-11 08:49:35,675 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 3 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 4 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 5 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 6 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 7 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 8 on 60020: starting
> > >> 2009-11-11 08:49:35,676 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: HRegionServer
> > started
> > >> at: 172.16.1.45:60020
> > >> 2009-11-11 08:49:35,677 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > >> handler 9 on 60020: starting
> > >> 2009-11-11 08:49:35,694 INFO
> > >> org.apache.hadoop.hbase.regionserver.StoreFile:
> > >> Allocating LruBlockCache with maximum size 199.4m
> > >> 2009-11-11 08:49:38,798 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> > >> .META.,,1
> > >> 2009-11-11 08:49:38,811 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > >> MSG_REGION_OPEN:
> > >> .META.,,1
> > >> 2009-11-11 08:49:38,932 INFO
> > org.apache.hadoop.hbase.regionserver.HRegion:
> > >> region .META.,,1/1028785192 available; sequence id is 0
> > >>
> > >> The only difference between that log file and the rest is that the
> > others
> > >> don't have
> > >> 2009-11-11 08:49:38,798 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> > >> .META.,,1
> > >> 2009-11-11 08:49:38,811 INFO
> > >> org.apache.hadoop.hbase.regionserver.HRegionServer: Worker:
> > >> MSG_REGION_OPEN:
> > >> .META.,,1
> > >> 2009-11-11 08:49:38,932 INFO
> > org.apache.hadoop.hbase.regionserver.HRegion:
> > >> region .META.,,1/1028785192 available; sequence id is 0
> > >>
> > >> except M5 which has it and says id is 3....if you want me to paste all
> > the
> > >> logs, I can...didn't think it was necessary.
> > >>
> > >> hbase(main):001:0> zk_dump
> > >>
> > >> HBase tree in ZooKeeper is rooted at /hbase
> > >>  Cluster up? true
> > >>  In safe mode? false
> > >>  Master address: 172.16.1.46:60000
> > >>  Region server holding ROOT: 172.16.1.96:60020
> > >>  Region servers:
> > >>   - 172.16.1.95:60020
> > >>   - 172.16.1.96:60020
> > >>   - 172.16.1.83:60020
> > >>   - 172.16.1.45:60020
> > >> hbase(main):002:0> status 'simple'
> > >> 4 live servers
> > >>   crunch3.local:60020 1257928281006
> > >>       requests=0, regions=1, usedHeap=31, maxHeap=998
> > >>   chris.local:60020 1257947367197
> > >>       requests=0, regions=0, usedHeap=29, maxHeap=996
> > >>   crunch2.local:60020 1257947370011
> > >>       requests=0, regions=0, usedHeap=29, maxHeap=998
> > >>   chanel.local:60020 1257947373328
> > >>       requests=0, regions=1, usedHeap=31, maxHeap=996
> > >> 0 dead servers
> > >>
> > >>
> > >> I started and stopped Hbase a couple times without deleting the HDFS
> > copy,
> > >> and everything seems to work fine....was that it??? What caused the
> > >> regionservers to orphan?
> > >>
> > >> On Wed, Nov 11, 2009 at 5:17 AM, Lars George <[email protected]>
> > wrote:
> > >>
> > >>  Hi Chris,
> > >>>
> > >>> As Tatsuya says, but I would expect you have to "kill" them as they
> are
> > >>> in
> > >>> a state where a daemon stop may not work as they will not be
> > perceptible
> > >>> to
> > >>> that event yet.
> > >>>
> > >>> I saw the same as Tatsuya and assumed you have an issue with your
> > quorum
> > >>> setting on the slaves. I have not followed the whole discussion in
> all
> > >>> the
> > >>> details so let me ask you how you configured it?
> > >>>
> > >>> I have personally done a symlink of my standalone zoo.cfg from the
> > >>> zookeeper/conf to the hbase/conf directory. You could also have set
> the
> > >>> quorum servers in the hbase-site.xml as per the "Getting started"
> > guide.
> > >>>
> > >>> If you have done one of those ways already then make sure all ZK
> > servers
> > >>> are up and can be reached from the region servers.
> > >>>
> > >>> Lars
> > >>>
> > >>> Tatsuya Kawano schrieb:
> > >>>
> > >>> Hi Chris, and thanks Lars for help.
> > >>>
> > >>>>
> > >>>> OK. So "jstack 22200" shows your region server is trying to finish
> > >>>> starting up, but stuck in a middle when try to get IP address of the
> > >>>> master from ZooKeeper.
> > >>>>
> > >>>> ===========================================================
> > >>>> "main" prio=10 tid=0x0805a800 nid=0x56d2 waiting on condition
> > >>>> [0xb72f2000]
> > >>>> java.lang.Thread.State: TIMED_WAITING (sleeping)
> > >>>> at java.lang.Thread.sleep(Native Method)
> > >>>> at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
> > >>>> at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:51)
> > >>>> at
> > >>>>
> > >>>>
> > >>>>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.watchMasterAddress(HRegionServer.java:387)
> > >>>> at
> > >>>>
> > >>>>
> > >>>>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.reinitializeZooKeeper(HRegionServer.java:315)
> > >>>> at
> > >>>>
> > >>>>
> > >>>>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.reinitialize(HRegionServer.java:306)
> > >>>> at
> > >>>>
> > >>>>
> > >>>>
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:276)
> > >>>> ===========================================================
> > >>>>
> > >>>> I still need to see the regionserver logs to figure out why this is
> > >>>> happening.
> > >>>>
> > >>>>
> > >>>> Also,
> > >>>>
> > >>>>
> > >>>>
> > >>>>  This is what I see when I run start-hbase.sh -- I can ssh into any
> of
> > >>>>> the
> > >>>>> boxes with no password just fine, it just gives me a weird first
> time
> > >>>>> host
> > >>>>> message...we get the same thing when we start up hadoop.
> > >>>>>
> > >>>>>
> > >>>>>  ...
> > >>>>
> > >>>>
> > >>>>  crunch2: regionserver running as process 6950. Stop it first.
> > >>>>> chanel: regionserver running as process 22200. Stop it first.
> > >>>>> crunch3: regionserver running as process 28962. Stop it first.
> > >>>>> chris: regionserver running as process 28719. Stop it first.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> This "Stop it first" message means your region servers didn't stop
> > >>>> when you ran stop-hbase.sh. The master couldn't locate those region
> > >>>> servers so it couldn't tell them to shutdown. This is why you've got
> > >>>> those orphan region servers. So until we finish setting your HBase
> > >>>> cluster up, you'll have to stop those region servers by hand.
> > >>>>
> > >>>> To do this, ssh to M2 -- M5, and type the following command:
> > >>>> ${HBASE_HOME}/bin/hbase-daemon.sh stop regionserver
> > >>>>
> > >>>> Then jps again to make sure HRegionServer doesn't exist. If the
> above
> > >>>> command doesn't work, you can use Unix "kill" command. Then ssh to
> M1,
> > >>>> run stop-hbase.sh to stop the master and ZooKeepers.
> > >>>>
> > >>>>
> > >>>> It's still a mystery you don't have regionserver logs while you have
> > >>>> zookeeper logs. Maybe those orphan region servers was the reason?  I
> > >>>> don't know, but you can give it another try after stopping them. So,
> > >>>> try to stop whole HBase / ZooKeeper process by above way, then run
> > >>>> start-hbase.sh once again. If you can get the regionserver log to
> us,
> > >>>> that would be great.
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> >
>

Re: HBase 0.20.1 Distributed Install Problems

Reply via email to