every handler thread, and every reader and also the accept thread are all blocked on flushing memstore. The handlers get blocked, then the readers also have a finite handoff queue and they are blocked and also the accept.
But why isnt memstore flushing? Do you have regionserver stats? ie: how much memstore global ram used? That is found on the main page of the regionserver http service, also found in ganglia/file stats. I havent looked at the logs yet, I'm off to lunch now. -ryan On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <[email protected]> wrote: > I had 3 consecutive successful runs processing 200GB data for each run > before hitting timeout problem in the 4th run. > > The 5th run couldn't proceed because master complained: > > 2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed > assignment of regions to > serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557, > load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983) > > but sjc1-hadoop6.sjc1 claimed: > 2011-02-13 16:13:32,258 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will > retry > > Here is stack trace for sjc1-hadoop6.sjc1: > http://pastebin.com/X8zWLXqu > > I didn't have chance to capture master stack trace as master exited after > that. > > I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me > for including individual email addresses as attachments wouldn't go through > hbase.apache.org > > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <[email protected]> wrote: >> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <[email protected]> wrote: >> >> > Thanks for the explanation. >> > Assuming the mixed class loading is static, why did this situation >> > develop >> > after 40 minutes of heavy load :-( >> > >> >> You didn't hit global memstore pressure until 40 minutes of load. >> >> -Todd >> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <[email protected]> wrote: >> > >> > > It's a standard linking issue, you get one class from one version >> > > another from another, they are mostly compatible in terms of >> > > signatures (hence no exceptions) but are subtly incompatible in >> > > different ways. In the stack trace you posted, the handlers were >> > > blocked in: >> > > >> > > at >> > > >> > >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382) >> > > >> > > and the thread: >> > > >> > > "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000 >> > > nid=0x7717 waiting for monitor entry [0x0000000000000000] >> > > java.lang.Thread.State: BLOCKED (on object monitor) >> > > >> > > was idle. >> > > >> > > The cache flusher thread should be flushing, and yet it's doing >> > > nothing. This also happens to be one of the classes that were >> > > changed. >> > > >> > > >> > > >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <[email protected]> wrote: >> > > > Can someone comment on my second question ? >> > > > Thanks >> > > > >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <[email protected]> >> > wrote: >> > > > >> > > >> As I suspected. >> > > >> >> > > >> It's a byproduct of our maven assembly process. The process could >> > > >> be >> > > >> fixed. I wouldn't mind. I don't support runtime checking of jars, >> > > >> there is such thing as too much tests, and this is an example of >> > > >> it. >> > > >> The check would then need a test, etc, etc. >> > > >> >> > > >> At SU we use new directories for each upgrade, copying the config >> > > >> over. With the lack of -default.xml this is easier than ever (just >> > > >> copy everything in conf/). With symlink switchover it makes roll >> > > >> forward/back as simple as doing a symlink switchover or back. I >> > > >> have >> > > >> to recommend this to everyone who doesnt have a management scheme. >> > > >> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <[email protected]> >> > > >> wrote: >> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the >> > > >> > classpath. >> > > >> > I wonder >> > > >> > 1. why hbase jar is placed in two directories - 0.20.6 didn't use >> > such >> > > >> > structure >> > > >> > 2. what from lib/hbase-0.90.0.jar could be picked up and why >> > > >> > there >> > > wasn't >> > > >> > exception in server log >> > > >> > >> > > >> > I think a JIRA should be filed for item 2 above - bail out when >> > > >> > the >> > > two >> > > >> > hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different >> > > >> versions. >> > > >> > >> > > >> > Cheers >> > > >> > >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <[email protected]> >> > > wrote: >> > > >> > >> > > >> >> What do you get when you: >> > > >> >> >> > > >> >> ls lib/hbase* >> > > >> >> >> > > >> >> I'm going to guess there is hbase-0.90.0.jar there >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <[email protected]> >> > wrote: >> > > >> >> > hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist >> > > >> >> > Would this be a problem ? >> > > >> >> > >> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson >> > > >> >> > <[email protected] >> > > >> > > >> wrote: >> > > >> >> > >> > > >> >> >> You don't have both the old and the new hbase jars in there >> > > >> >> >> do >> > > you? >> > > >> >> >> >> > > >> >> >> -ryan >> > > >> >> >> >> > > >> >> >> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <[email protected]> >> > > wrote: >> > > >> >> >> > .META. went offline during second flow attempt. >> > > >> >> >> > >> > > >> >> >> > The time out I mentioned happened for 1st and 3rd attempts. >> > > HBase >> > > >> was >> > > >> >> >> > restarted before the 1st and 3rd attempts. >> > > >> >> >> > >> > > >> >> >> > Here is jstack: >> > > >> >> >> > http://pastebin.com/EHMSvsRt >> > > >> >> >> > >> > > >> >> >> > On Thu, Feb 10, 2011 at 3:04 PM, Stack <[email protected]> >> > > wrote: >> > > >> >> >> > >> > > >> >> >> >> So, .META. is not online? What happens if you use shell >> > > >> >> >> >> at >> > > this >> > > >> >> time. >> > > >> >> >> >> >> > > >> >> >> >> Your attachement did not come across Ted. Mind >> > > >> >> >> >> postbin'ing >> > it? >> > > >> >> >> >> >> > > >> >> >> >> St.Ack >> > > >> >> >> >> >> > > >> >> >> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu >> > > >> >> >> >> <[email protected] >> > > >> > > >> wrote: >> > > >> >> >> >> > I replaced hbase jar with hbase-0.90.1.jar >> > > >> >> >> >> > I also upgraded client side jar to hbase-0.90.1.jar >> > > >> >> >> >> > >> > > >> >> >> >> > Our map tasks were running faster than before for about >> > > >> >> >> >> > 50 >> > > >> minutes. >> > > >> >> >> >> However, >> > > >> >> >> >> > map tasks then timed out calling flushCommits(). This >> > > happened >> > > >> even >> > > >> >> >> after >> > > >> >> >> >> > fresh restart of hbase. >> > > >> >> >> >> > >> > > >> >> >> >> > I don't see any exception in region server logs. >> > > >> >> >> >> > >> > > >> >> >> >> > In master log, I found: >> > > >> >> >> >> > >> > > >> >> >> >> > 2011-02-10 18:24:15,286 DEBUG >> > > >> >> >> >> > >> > > >> >> >> >> > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >> > > >> Opened >> > > >> >> >> region >> > > >> >> >> >> > -ROOT-,,0.70236052 on sjc1-hadoop6.X.com >> > ,60020,1297362251595 >> > > >> >> >> >> > 2011-02-10 18:24:15,349 INFO >> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker: >> > > >> >> >> >> > Failed verification of .META.,,1 at address=null; >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >> > > >> >> >> >> > Region >> > is >> > > not >> > > >> >> >> online: >> > > >> >> >> >> > .META.,,1 >> > > >> >> >> >> > 2011-02-10 18:24:15,350 DEBUG >> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000 Creating (or updating) >> > > unassigned >> > > >> >> node >> > > >> >> >> for >> > > >> >> >> >> > 1028785192 with OFFLINE state >> > > >> >> >> >> > >> > > >> >> >> >> > I am attaching region server (which didn't respond to >> > > >> >> stop-hbase.sh) >> > > >> >> >> >> jstack. >> > > >> >> >> >> > >> > > >> >> >> >> > FYI >> > > >> >> >> >> > >> > > >> >> >> >> > On Thu, Feb 10, 2011 at 10:10 AM, Stack >> > > >> >> >> >> > <[email protected]> >> > > >> wrote: >> > > >> >> >> >> >> >> > > >> >> >> >> >> Thats probably enough Ted. The 0.90.1 >> > > >> >> >> >> >> hbase-default.xml >> > has >> > > an >> > > >> >> extra >> > > >> >> >> >> >> config. to enable the experimental HBASE-3455 feature >> > > >> >> >> >> >> but >> > > you >> > > >> can >> > > >> >> >> copy >> > > >> >> >> >> >> that over if you want to try playing with it (it >> > > >> >> >> >> >> defaults >> > > off >> > > >> so >> > > >> >> >> you'd >> > > >> >> >> >> >> copy over the config. if you wanted to set it to true). >> > > >> >> >> >> >> >> > > >> >> >> >> >> St.Ack >> > > >> >> >> >> > >> > > >> >> >> >> > >> > > >> >> >> >> >> > > >> >> >> > >> > > >> >> >> >> > > >> >> > >> > > >> >> >> > > >> > >> > > >> >> > > > >> > > >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera > >
