As I was saying on IRC to the other guy working with you, when the region server crashed because of a FileNotFound, there were 65 write-ahead logs. The master in the version you are using, 0.19,3, only replays logs when it notices a region server is dead but since you only had one and it held both -ROOT- and .META., it seems it was stuck on this:
2009-10-07 01:29:16,049 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan ROOT region java.net.ConnectException: Call to /10.244.9.171:60020 failed on connection exception: java.net.ConnectException: Connection refused for the whole day until it was restarted. In 0.20, it replays logs on a fresh boot. As to why the FileNotFound happened, it's hard to tell... J-D On Wed, Oct 7, 2009 at 3:59 PM, Ananth T. Sarathy <[email protected]> wrote: > there is all sorts of things in the bucket when I explore it. > > We are going to set up .20.0 and point it to a new bucket. Any tips I should > know about to avoid something like this or data loss? > > Ananth T Sarathy > > > On Wed, Oct 7, 2009 at 3:55 PM, Andrew Purtell <[email protected]> wrote: > >> One possibility is you loaded data, but not enough to cause a flush, then >> there appeared to be some network related problem, and you killed the >> regionservers hard (-9?) while the filesystem was unavailable. This >> unfortunate string of circumstances would cause data loss. However you said >> the cluster had been running for 6 days so a major compaction (runs once >> every 24 hours) would have flushed and persisted data. Is there anything in >> the bucket? (hadoop fs -lsr ...) >> >> 0.20 is definitely the way to go, for a number of reasons. >> >> - Andy >> >> >> >> >> ________________________________ >> From: Ananth T. Sarathy <[email protected]> >> To: [email protected] >> Sent: Wed, October 7, 2009 12:46:24 PM >> Subject: Re: hbase on s3 and safemode >> >> thanks for all the help >> >> <property> >> <name>hbase.rootdir</name> >> <value>s3://hbase2.s3.amazonaws.com:80/hbasedata</value> >> <description>The directory shared by region servers. >> Should be fully-qualified to include the filesystem to use. >> E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR >> </description> >> </property> >> >> that's in our hbase-site.xml >> >> >> We had been running for about 6 days with new issues. at 130 this morning >> it just crapped out. >> >> We are thinking about just moving to 20.0 and starting over. >> >> Ananth T Sarathy >> >> >> On Wed, Oct 7, 2009 at 3:41 PM, Andrew Purtell <[email protected]> >> wrote: >> >> > Did you edit hbase-site.xml such that HBase data directories are not in >> > /tmp? Maybe a silly question... but it happens sometimes. >> > >> > If your hbase.rootdir points to an HDFS filesystem, what does 'hadoop fs >> > -lsr hdfs://namenode:port/path/to/hbase/root' show? >> > >> > You said this was working before? Did you shut down and bring HBase back >> up >> > before without trouble? Is this a new install? >> > >> > - Andy >> > >> > >> > >> > >> > >> > ________________________________ >> > From: Ananth T. Sarathy <[email protected]> >> > To: [email protected] >> > Sent: Wed, October 7, 2009 12:34:28 PM >> > Subject: Re: hbase on s3 and safemode >> > >> > ok. so we finally got the regionserver to come up (We killed all the >> > processes on the box and finally the regionserver came back up) >> > but when it did, there is no data in our tables. Though the tables are >> > there. Any ideas where the data went or how I can get it back? >> > >> > Ananth T Sarathy >> > >> > >> > On Wed, Oct 7, 2009 at 2:46 PM, Andrew Purtell <[email protected]> >> > wrote: >> > >> > > One option is to add SYSV init scripts that on boot take the following >> > > equivalent actions: >> > > >> > > hbase-daemon.sh start zookeeper >> > > >> > > hbase-daemon.sh start master >> > > >> > > hbase-daemon.sh start regionserver >> > > >> > > Set the respective init scripts to run according to host role. >> > > >> > > This presumes you have also added init scripts that start up DFS >> daemons >> > > wherever they should be, equivalents to the following: >> > > >> > > hadoop-daemon.sh start namenode >> > > >> > > hadoop-daemon.sh start datanode >> > > >> > > hadoop-daemon.sh start secondarynamenode >> > > >> > > You can start everything up all at once. The respective daemons will >> wait >> > > for each others' services to become available. Ignore ZK noise in the >> > logs >> > > about connection difficulties unless they persist for minutes. >> > > >> > > If you want to try out the Cloudera Hadoop distribution for 0.20, they >> > have >> > > RPMs that will take care of all of this for you, and we have a RPM for >> > that >> > > platform that I can provide you. >> > > >> > > Do also check your network configuration. >> > > >> > > - Andy >> > > >> > > >> > > >> > > >> > > ________________________________ >> > > From: Ananth T. Sarathy <[email protected]> >> > > To: [email protected] >> > > Sent: Wed, October 7, 2009 11:36:22 AM >> > > Subject: Re: hbase on s3 and safemode >> > > >> > > is there a way to turn my regionservers on implicitly besides >> > > start-hbase.sh? >> > > Ananth T Sarathy >> > > >> > > >> > > On Wed, Oct 7, 2009 at 2:31 PM, Andrew Purtell <[email protected]> >> > > wrote: >> > > >> > > > HBase won't leave safe mode if the regionservers cannot contact the >> > > master. >> > > > So the question is why cannot your regionservers contact the master. >> If >> > > the >> > > > regionserver processes are confirmed running, then it's a firewall or >> > AWS >> > > > Security Groups config problem most likely. >> > > > >> > > > status was a shell command added in 0.20 IIRC. >> > > > >> > > > - Andy >> > > > >> > > > >> > > > >> > > > >> > > > ________________________________ >> > > > From: Ananth T. Sarathy <[email protected]> >> > > > To: [email protected] >> > > > Sent: Wed, October 7, 2009 11:04:03 AM >> > > > Subject: Re: hbase on s3 and safemode >> > > > >> > > > i suppose we need to, but for now it's kind of a pain because we need >> > to >> > > > coordinate our clients. >> > > > >> > > > But the problem is why was it working and all of the sudden it's >> stuck >> > in >> > > > safemode and how to can get back up? >> > > > >> > > > Ananth T Sarathy >> > > > >> > > > >> > > > On Wed, Oct 7, 2009 at 1:58 PM, stack <[email protected]> wrote: >> > > > >> > > > > Can you update to 0.20.0? (Oodles of improvements). >> > > > > St.Ack >> > > > > >> > > > > On Wed, Oct 7, 2009 at 10:56 AM, Ananth T. Sarathy < >> > > > > [email protected]> wrote: >> > > > > >> > > > > > I get an error >> > > > > > >> > > > > > hbase(main):001:0> status "detailed" >> > > > > > NoMethodError: undefined method `status' for #<Object:0x5585c0de> >> > > > > > from (hbase):2 >> > > > > > hbase(main):002:0> status "detailed" >> > > > > > NoMethodError: undefined method `status' for #<Object:0x5585c0de> >> > > > > > from (hbase):3 >> > > > > > >> > > > > > >> > > > > > we are running 0.19.3 >> > > > > > >> > > > > > Ananth T Sarathy >> > > > > > >> > > > > > >> > > > > > On Wed, Oct 7, 2009 at 1:51 PM, stack <[email protected]> wrote: >> > > > > > >> > > > > > > This state persists even if you shutdown hbase and zk and >> > restart? >> > > > > > > >> > > > > > > In shell, do: >> > > > > > > >> > > > > > > > status "detailed" >> > > > > > > >> > > > > > > At the top there is a section which says regions in >> transistion. >> > > > > > Anything >> > > > > > > there? >> > > > > > > >> > > > > > > St.Ack >> > > > > > > >> > > > > > > >> > > > > > > On Wed, Oct 7, 2009 at 10:35 AM, Ananth T. Sarathy < >> > > > > > > [email protected]> wrote: >> > > > > > > >> > > > > > > > Here is the log since I started it... >> > > > > > > > >> > > > > > > > Wed Oct 7 13:27:26 EDT 2009 Starting master on >> ip-10-244-9-171 >> > > > > > > > ulimit -n 1024 >> > > > > > > > 2009-10-07 13:27:26,404 INFO >> > > > org.apache.hadoop.hbase.master.HMaster: >> > > > > > > > vmName=Java HotSpot(TM) 64-Bit Server VM, vmVendor=Sun >> > > Microsystems >> > > > > > Inc., >> > > > > > > > vmVersion=14.2-b01 >> > > > > > > > 2009-10-07 13:27:26,405 INFO >> > > > org.apache.hadoop.hbase.master.HMaster: >> > > > > > > > vmInputArguments=[-Xmx2000m, -XX:+HeapDumpOnOutOfMemoryError, >> > > > > > > > -Djava.io.tmpdir=/mnt/tmp, >> > > > > > > > -Dhbase.log.dir=/mnt/apps/hadoop/hbase/bin/../logs, >> > > > > > > > -Dhbase.log.file=hbase-root-master-ip-10-244-9-171.log, >> > > > > > > > -Dhbase.home.dir=/mnt/apps/hadoop/hbase/bin/.., >> > > > -Dhbase.id.str=root, >> > > > > > > > -Dhbase.root.logger=INFO,DRFA, >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> -Djava.library.path=/mnt/apps/hadoop/hbase/bin/../lib/native/Linux-amd64-64] >> > > > > > > > 2009-10-07 13:27:27,525 INFO >> > > > org.apache.hadoop.hbase.master.HMaster: >> > > > > > Root >> > > > > > > > region dir: s3:// >> > > > > hbase2.s3.amazonaws.com:80/hbasedata/-ROOT-/70236052 >> > > > > > > > 2009-10-07< >> > > > > > > >> > > > > >> > > >> http://hbase2.s3.amazonaws.com:80/hbasedata/-ROOT-/70236052%0A2009-10-07 >> > > > > > >13:27:27,751 >> > > > > > > INFO org.apache.hadoop.hbase.ipc.HBaseRpcMetrics: >> > > > > > > > Initializing RPC Metrics with hostName=HMaster, port=60000 >> > > > > > > > 2009-10-07 13:27:27,827 INFO >> > > > org.apache.hadoop.hbase.master.HMaster: >> > > > > > > > HMaster >> > > > > > > > initialized on 10.244.9.171:60000 >> > > > > > > > 2009-10-07 13:27:27,829 INFO >> > > > > org.apache.hadoop.metrics.jvm.JvmMetrics: >> > > > > > > > Initializing JVM Metrics with processName=Master, >> > > sessionId=HMaster >> > > > > > > > 2009-10-07 13:27:27,830 INFO >> > > > > > > > org.apache.hadoop.hbase.master.metrics.MasterMetrics: >> > Initialized >> > > > > > > > 2009-10-07 13:27:27,932 INFO org.mortbay.util.Credential: >> > > Checking >> > > > > > > Resource >> > > > > > > > aliases >> > > > > > > > 2009-10-07 13:27:27,936 INFO org.mortbay.http.HttpServer: >> > Version >> > > > > > > > Jetty/5.1.4 >> > > > > > > > 2009-10-07 13:27:27,936 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > HttpContext[/logs,/logs] >> > > > > > > > 2009-10-07 13:27:28,202 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > org.mortbay.jetty.servlet.webapplicationhand...@3209fa8f >> > > > > > > > 2009-10-07 13:27:28,244 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > WebApplicationContext[/static,/static] >> > > > > > > > 2009-10-07 13:27:28,361 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > org.mortbay.jetty.servlet.webapplicationhand...@b0c0f66 >> > > > > > > > 2009-10-07 13:27:28,364 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > WebApplicationContext[/,/] >> > > > > > > > 2009-10-07 13:27:28,636 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > org.mortbay.jetty.servlet.webapplicationhand...@3c2d7440 >> > > > > > > > 2009-10-07 13:27:28,638 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > WebApplicationContext[/api,rest] >> > > > > > > > 2009-10-07 13:27:28,639 INFO org.mortbay.http.SocketListener: >> > > > Started >> > > > > > > > SocketListener on 0.0.0.0:60010 >> > > > > > > > 2009-10-07 13:27:28,639 INFO org.mortbay.util.Container: >> > Started >> > > > > > > > org.mortbay.jetty.ser...@28b301f2 >> > > > > > > > 2009-10-07 13:27:28,640 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > Responder: starting >> > > > > > > > 2009-10-07 13:27:28,641 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > listener on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,641 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 0 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,641 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 1 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,641 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 2 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 3 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 4 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 5 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 6 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 7 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 8 on 60000: starting >> > > > > > > > 2009-10-07 13:27:28,642 DEBUG >> > > > org.apache.hadoop.hbase.master.HMaster: >> > > > > > > > Started service threads >> > > > > > > > 2009-10-07 13:27:28,643 INFO >> org.apache.hadoop.ipc.HBaseServer: >> > > IPC >> > > > > > > Server >> > > > > > > > handler 9 on 60000: starting >> > > > > > > > 2009-10-07 13:28:09,519 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:11,542 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:13,543 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:15,545 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:17,548 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:19,555 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:28:27,834 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > 2009-10-07 13:29:27,832 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > 2009-10-07 13:29:37,593 INFO >> > > > > > > org.apache.hadoop.hbase.master.RegionManager: >> > > > > > > > in safe mode >> > > > > > > > 2009-10-07 13:30:27,834 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > 2009-10-07 13:31:27,836 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > 2009-10-07 13:32:27,838 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > 2009-10-07 13:33:27,840 INFO >> > > > > > org.apache.hadoop.hbase.master.BaseScanner: >> > > > > > > > All >> > > > > > > > 0 .META. region(s) scanned >> > > > > > > > >> > > > > > > > >> > > > > > > > Ananth T Sarathy >> > > > > > > > >> > > > > > > > >> > > > > > > > On Wed, Oct 7, 2009 at 1:20 PM, stack <[email protected]> >> > wrote: >> > > > > > > > >> > > > > > > > > Thats interesting to hear. Keep us posted. >> > > > > > > > > >> > > > > > > > > HBase asks the filesystem if its in safe mode and if it is, >> > it >> > > > > parks >> > > > > > > > > itself. Here is code from master: >> > > > > > > > > >> > > > > > > > > if (this.fs instanceof DistributedFileSystem) { >> > > > > > > > > // Make sure dfs is not in safe mode >> > > > > > > > > String message = "Waiting for dfs to exit safe >> mode..."; >> > > > > > > > > while (((DistributedFileSystem) fs).setSafeMode( >> > > > > > > > > FSConstants.SafeModeAction.SAFEMODE_GET)) { >> > > > > > > > > LOG.info(message); >> > > > > > > > > try { >> > > > > > > > > Thread.sleep(this.threadWakeFrequency); >> > > > > > > > > } catch (InterruptedException e) { >> > > > > > > > > //continue >> > > > > > > > > } >> > > > > > > > > } >> > > > > > > > > } >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Then there is hbase's notion of safemode. It will be in >> safe >> > > > mode >> > > > > > > until >> > > > > > > > it >> > > > > > > > > does initial scan of catalog tables. The master keeps a >> flag >> > > in >> > > > > > > > zookeeper >> > > > > > > > > while its in safemode so regionservers are aware of the >> > state: >> > > > > > > > > >> > > > > > > > > public boolean inSafeMode() { >> > > > > > > > > if (safeMode) { >> > > > > > > > > if(isInitialMetaScanComplete() && >> > > regionsInTransition.size() >> > > > > == >> > > > > > 0 >> > > > > > > && >> > > > > > > > > tellZooKeeperOutOfSafeMode()) { >> > > > > > > > > master.connection.unsetRootRegionLocation(); >> > > > > > > > > safeMode = false; >> > > > > > > > > LOG.info("exiting safe mode"); >> > > > > > > > > } else { >> > > > > > > > > LOG.info("in safe mode"); >> > > > > > > > > } >> > > > > > > > > } >> > > > > > > > > return safeMode; >> > > > > > > > > } >> > > > > > > > > >> > > > > > > > > Have you seen the .META. and -ROOT- deploy to >> regionservers? >> > > > Have >> > > > > > you >> > > > > > > > seen >> > > > > > > > > that these regions being scanned in the master log? >> (Enable >> > > > DEBUG >> > > > > if >> > > > > > > not >> > > > > > > > > already enabled). >> > > > > > > > > >> > > > > > > > > Yours, >> > > > > > > > > ST.Ack >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Wed, Oct 7, 2009 at 10:06 AM, Ananth T. Sarathy < >> > > > > > > > > [email protected]> wrote: >> > > > > > > > > >> > > > > > > > > > We have been running Hbase on a s3 filesystem. It's the >> > hbase >> > > > > > > > > regionserver, >> > > > > > > > > > not HDFS since we are using s3. We haven't felt like >> it's >> > > been >> > > > > too >> > > > > > > > slow, >> > > > > > > > > > though the amount of data we are pushing isn't >> sufficiently >> > > > large >> > > > > > > > enough >> > > > > > > > > to >> > > > > > > > > > notice yet. >> > > > > > > > > > Ananth T Sarathy >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Wed, Oct 7, 2009 at 12:47 PM, stack <[email protected] >> > >> > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > HBase or HDFS is in safe mode. My guess is that its >> the >> > > > > latter. >> > > > > > > > Can >> > > > > > > > > > you >> > > > > > > > > > > figure from HDFS logs why it won't leave safe mode? >> > > Usually >> > > > > > > > > > > under-replication or a loss of a large swath of the >> > cluster >> > > > > will >> > > > > > > flip >> > > > > > > > > on >> > > > > > > > > > > the >> > > > > > > > > > > safe-mode switch. >> > > > > > > > > > > >> > > > > > > > > > > Are you trying to run HBASE on an S3 filesystem? An >> > > HBasista >> > > > > > tried >> > > > > > > > it >> > > > > > > > > in >> > > > > > > > > > > the past and, FYI, found it insufferably slow. Let us >> > know >> > > > how >> > > > > > it >> > > > > > > > goes >> > > > > > > > > > for >> > > > > > > > > > > you. >> > > > > > > > > > > >> > > > > > > > > > > Thanks, >> > > > > > > > > > > St.Ack >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Oct 7, 2009 at 9:33 AM, Ananth T. Sarathy < >> > > > > > > > > > > [email protected]> wrote: >> > > > > > > > > > > >> > > > > > > > > > > > my regionserver has been stuck in safemode. What can >> i >> > > do >> > > > to >> > > > > > get >> > > > > > > > it >> > > > > > > > > > out >> > > > > > > > > > > > safemode? >> > > > > > > > > > > > >> > > > > > > > > > > > Ananth T Sarathy >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > __________________________________________________ >> > > > Do You Yahoo!? >> > > > Tired of spam? Yahoo! Mail has the best spam protection around >> > > > http://mail.yahoo.com >> > > >> > > >> > > >> > > >> > > >> > >> > >> > >> > >> > >> >> >> __________________________________________________ >> Do You Yahoo!? >> Tired of spam? Yahoo! Mail has the best spam protection around >> http://mail.yahoo.com >> >
