I disabled MSLAB. My flow still couldn't make much progress.
> In this region server stack trace, I don't see > MemStoreFlusher.reclaimMemStoreMemory() call: > http://pastebin.com/uiBRidUa > > > On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <[email protected]> wrote: > >> I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo. >> FYI >> >> >> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <[email protected]> wrote: >> >>> Since master server shut down, I restarted the cluster. >>> The next flow over 200GB data got timed out. >>> >>> Here are some region server stat: >>> >>> request=0.0, regions=95, stores=213, storefiles=65, >>> storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0, >>> flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968, >>> blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0, >>> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, >>> blockCacheHitCachingRatio=0 >>> >>> request=0.0, regions=95, stores=232, storefiles=72, >>> storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0, >>> flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928, >>> blockCacheFree=822263344, blockCacheCount=712, blockCacheHitCount=112478, >>> blockCacheMissCount=712, blockCacheEvictedCount=0, blockCacheHitRatio=99, >>> blockCacheHitCachingRatio=99 >>> >>> Thanks >>> >>> >>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <[email protected]>wrote: >>> >>>> every handler thread, and every reader and also the accept thread are >>>> all blocked on flushing memstore. The handlers get blocked, then the >>>> readers also have a finite handoff queue and they are blocked and also >>>> the accept. >>>> >>>> But why isnt memstore flushing? Do you have regionserver stats? ie: >>>> how much memstore global ram used? That is found on the main page of >>>> the regionserver http service, also found in ganglia/file stats. >>>> >>>> I havent looked at the logs yet, I'm off to lunch now. >>>> >>>> -ryan >>>> >>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <[email protected]> wrote: >>>> > I had 3 consecutive successful runs processing 200GB data for each run >>>> > before hitting timeout problem in the 4th run. >>>> > >>>> > The 5th run couldn't proceed because master complained: >>>> > >>>> > 2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: >>>> Failed >>>> > assignment of regions to >>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557, >>>> > load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983) >>>> > >>>> > but sjc1-hadoop6.sjc1 claimed: >>>> > 2011-02-13 16:13:32,258 DEBUG >>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, >>>> will >>>> > retry >>>> > >>>> > Here is stack trace for sjc1-hadoop6.sjc1: >>>> > http://pastebin.com/X8zWLXqu >>>> > >>>> > I didn't have chance to capture master stack trace as master exited >>>> after >>>> > that. >>>> > >>>> > I also attach master and region server log on sjc1-hadoop6.sjc1 - >>>> pardon me >>>> > for including individual email addresses as attachments wouldn't go >>>> through >>>> > hbase.apache.org >>>> > >>>> > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <[email protected]> >>>> wrote: >>>> >> >>>> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <[email protected]> wrote: >>>> >> >>>> >> > Thanks for the explanation. >>>> >> > Assuming the mixed class loading is static, why did this situation >>>> >> > develop >>>> >> > after 40 minutes of heavy load :-( >>>> >> > >>>> >> >>>> >> You didn't hit global memstore pressure until 40 minutes of load. >>>> >> >>>> >> -Todd >>>> >> >>>> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <[email protected]> >>>> wrote: >>>> >> > >>>> >> > > It's a standard linking issue, you get one class from one version >>>> >> > > another from another, they are mostly compatible in terms of >>>> >> > > signatures (hence no exceptions) but are subtly incompatible in >>>> >> > > different ways. In the stack trace you posted, the handlers were >>>> >> > > blocked in: >>>> >> > > >>>> >> > > at >>>> >> > > >>>> >> > >>>> >> > >>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382) >>>> >> > > >>>> >> > > and the thread: >>>> >> > > >>>> >> > > "regionserver60020.cacheFlusher" daemon prio=10 >>>> tid=0x00002aaabc21e000 >>>> >> > > nid=0x7717 waiting for monitor entry [0x0000000000000000] >>>> >> > > java.lang.Thread.State: BLOCKED (on object monitor) >>>> >> > > >>>> >> > > was idle. >>>> >> > > >>>> >> > > The cache flusher thread should be flushing, and yet it's doing >>>> >> > > nothing. This also happens to be one of the classes that were >>>> >> > > changed. >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <[email protected]> >>>> wrote: >>>> >> > > > Can someone comment on my second question ? >>>> >> > > > Thanks >>>> >> > > > >>>> >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson < >>>> [email protected]> >>>> >> > wrote: >>>> >> > > > >>>> >> > > >> As I suspected. >>>> >> > > >> >>>> >> > > >> It's a byproduct of our maven assembly process. The process >>>> could >>>> >> > > >> be >>>> >> > > >> fixed. I wouldn't mind. I don't support runtime checking of >>>> jars, >>>> >> > > >> there is such thing as too much tests, and this is an example >>>> of >>>> >> > > >> it. >>>> >> > > >> The check would then need a test, etc, etc. >>>> >> > > >> >>>> >> > > >> At SU we use new directories for each upgrade, copying the >>>> config >>>> >> > > >> over. With the lack of -default.xml this is easier than ever >>>> (just >>>> >> > > >> copy everything in conf/). With symlink switchover it makes >>>> roll >>>> >> > > >> forward/back as simple as doing a symlink switchover or back. >>>> I >>>> >> > > >> have >>>> >> > > >> to recommend this to everyone who doesnt have a management >>>> scheme. >>>> >> > > >> >>>> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <[email protected]> >>>> >> > > >> wrote: >>>> >> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the >>>> >> > > >> > classpath. >>>> >> > > >> > I wonder >>>> >> > > >> > 1. why hbase jar is placed in two directories - 0.20.6 >>>> didn't use >>>> >> > such >>>> >> > > >> > structure >>>> >> > > >> > 2. what from lib/hbase-0.90.0.jar could be picked up and why >>>> >> > > >> > there >>>> >> > > wasn't >>>> >> > > >> > exception in server log >>>> >> > > >> > >>>> >> > > >> > I think a JIRA should be filed for item 2 above - bail out >>>> when >>>> >> > > >> > the >>>> >> > > two >>>> >> > > >> > hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of >>>> different >>>> >> > > >> versions. >>>> >> > > >> > >>>> >> > > >> > Cheers >>>> >> > > >> > >>>> >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson < >>>> [email protected]> >>>> >> > > wrote: >>>> >> > > >> > >>>> >> > > >> >> What do you get when you: >>>> >> > > >> >> >>>> >> > > >> >> ls lib/hbase* >>>> >> > > >> >> >>>> >> > > >> >> I'm going to guess there is hbase-0.90.0.jar there >>>> >> > > >> >> >>>> >> > > >> >> >>>> >> > > >> >> >>>> >> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu < >>>> [email protected]> >>>> >> > wrote: >>>> >> > > >> >> > hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist >>>> >> > > >> >> > Would this be a problem ? >>>> >> > > >> >> > >>>> >> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson >>>> >> > > >> >> > <[email protected] >>>> >> > > >>>> >> > > >> wrote: >>>> >> > > >> >> > >>>> >> > > >> >> >> You don't have both the old and the new hbase jars in >>>> there >>>> >> > > >> >> >> do >>>> >> > > you? >>>> >> > > >> >> >> >>>> >> > > >> >> >> -ryan >>>> >> > > >> >> >> >>>> >> > > >> >> >> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu < >>>> [email protected]> >>>> >> > > wrote: >>>> >> > > >> >> >> > .META. went offline during second flow attempt. >>>> >> > > >> >> >> > >>>> >> > > >> >> >> > The time out I mentioned happened for 1st and 3rd >>>> attempts. >>>> >> > > HBase >>>> >> > > >> was >>>> >> > > >> >> >> > restarted before the 1st and 3rd attempts. >>>> >> > > >> >> >> > >>>> >> > > >> >> >> > Here is jstack: >>>> >> > > >> >> >> > http://pastebin.com/EHMSvsRt >>>> >> > > >> >> >> > >>>> >> > > >> >> >> > On Thu, Feb 10, 2011 at 3:04 PM, Stack < >>>> [email protected]> >>>> >> > > wrote: >>>> >> > > >> >> >> > >>>> >> > > >> >> >> >> So, .META. is not online? What happens if you use >>>> shell >>>> >> > > >> >> >> >> at >>>> >> > > this >>>> >> > > >> >> time. >>>> >> > > >> >> >> >> >>>> >> > > >> >> >> >> Your attachement did not come across Ted. Mind >>>> >> > > >> >> >> >> postbin'ing >>>> >> > it? >>>> >> > > >> >> >> >> >>>> >> > > >> >> >> >> St.Ack >>>> >> > > >> >> >> >> >>>> >> > > >> >> >> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu >>>> >> > > >> >> >> >> <[email protected] >>>> >> > > >>>> >> > > >> wrote: >>>> >> > > >> >> >> >> > I replaced hbase jar with hbase-0.90.1.jar >>>> >> > > >> >> >> >> > I also upgraded client side jar to hbase-0.90.1.jar >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > Our map tasks were running faster than before for >>>> about >>>> >> > > >> >> >> >> > 50 >>>> >> > > >> minutes. >>>> >> > > >> >> >> >> However, >>>> >> > > >> >> >> >> > map tasks then timed out calling flushCommits(). >>>> This >>>> >> > > happened >>>> >> > > >> even >>>> >> > > >> >> >> after >>>> >> > > >> >> >> >> > fresh restart of hbase. >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > I don't see any exception in region server logs. >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > In master log, I found: >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,286 DEBUG >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > >>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>> >> > > >> Opened >>>> >> > > >> >> >> region >>>> >> > > >> >> >> >> > -ROOT-,,0.70236052 on sjc1-hadoop6.X.com >>>> >> > ,60020,1297362251595 >>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,349 INFO >>>> >> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker: >>>> >> > > >> >> >> >> > Failed verification of .META.,,1 at address=null; >>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >>>> >> > > >> >> >> >> > Region >>>> >> > is >>>> >> > > not >>>> >> > > >> >> >> online: >>>> >> > > >> >> >> >> > .META.,,1 >>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,350 DEBUG >>>> >> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>> >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000 Creating (or >>>> updating) >>>> >> > > unassigned >>>> >> > > >> >> node >>>> >> > > >> >> >> for >>>> >> > > >> >> >> >> > 1028785192 with OFFLINE state >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > I am attaching region server (which didn't respond >>>> to >>>> >> > > >> >> stop-hbase.sh) >>>> >> > > >> >> >> >> jstack. >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > FYI >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > On Thu, Feb 10, 2011 at 10:10 AM, Stack >>>> >> > > >> >> >> >> > <[email protected]> >>>> >> > > >> wrote: >>>> >> > > >> >> >> >> >> >>>> >> > > >> >> >> >> >> Thats probably enough Ted. The 0.90.1 >>>> >> > > >> >> >> >> >> hbase-default.xml >>>> >> > has >>>> >> > > an >>>> >> > > >> >> extra >>>> >> > > >> >> >> >> >> config. to enable the experimental HBASE-3455 >>>> feature >>>> >> > > >> >> >> >> >> but >>>> >> > > you >>>> >> > > >> can >>>> >> > > >> >> >> copy >>>> >> > > >> >> >> >> >> that over if you want to try playing with it (it >>>> >> > > >> >> >> >> >> defaults >>>> >> > > off >>>> >> > > >> so >>>> >> > > >> >> >> you'd >>>> >> > > >> >> >> >> >> copy over the config. if you wanted to set it to >>>> true). >>>> >> > > >> >> >> >> >> >>>> >> > > >> >> >> >> >> St.Ack >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> > >>>> >> > > >> >> >> >> >>>> >> > > >> >> >> > >>>> >> > > >> >> >> >>>> >> > > >> >> > >>>> >> > > >> >> >>>> >> > > >> > >>>> >> > > >> >>>> >> > > > >>>> >> > > >>>> >> > >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Todd Lipcon >>>> >> Software Engineer, Cloudera >>>> > >>>> > >>>> >>> >>> >> >
