Here is stack trace for one region server which didn't cleanly shutdown: http://pastebin.com/PEtEdi4g
I noticed IPC Server handler 7 was holding lock in reclaimMemStoreMemory(). Here is related snippet from the region server's log: http://pastebin.com/KeXppURX I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was splitting. Please advise whether there could be any relation between the above snippet and lock up in MemStoreFlusher. Thanks On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <[email protected]> wrote: > I disabled MSLAB. > My flow still couldn't make much progress. > > >> In this region server stack trace, I don't see >> MemStoreFlusher.reclaimMemStoreMemory() call: >> http://pastebin.com/uiBRidUa >> >> >> On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <[email protected]> wrote: >> >>> I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo. >>> FYI >>> >>> >>> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <[email protected]> wrote: >>> >>>> Since master server shut down, I restarted the cluster. >>>> The next flow over 200GB data got timed out. >>>> >>>> Here are some region server stat: >>>> >>>> request=0.0, regions=95, stores=213, storefiles=65, >>>> storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0, >>>> flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968, >>>> blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0, >>>> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, >>>> blockCacheHitCachingRatio=0 >>>> >>>> request=0.0, regions=95, stores=232, storefiles=72, >>>> storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0, >>>> flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928, >>>> blockCacheFree=822263344, blockCacheCount=712, blockCacheHitCount=112478, >>>> blockCacheMissCount=712, blockCacheEvictedCount=0, blockCacheHitRatio=99, >>>> blockCacheHitCachingRatio=99 >>>> >>>> Thanks >>>> >>>> >>>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <[email protected]>wrote: >>>> >>>>> every handler thread, and every reader and also the accept thread are >>>>> all blocked on flushing memstore. The handlers get blocked, then the >>>>> readers also have a finite handoff queue and they are blocked and also >>>>> the accept. >>>>> >>>>> But why isnt memstore flushing? Do you have regionserver stats? ie: >>>>> how much memstore global ram used? That is found on the main page of >>>>> the regionserver http service, also found in ganglia/file stats. >>>>> >>>>> I havent looked at the logs yet, I'm off to lunch now. >>>>> >>>>> -ryan >>>>> >>>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <[email protected]> wrote: >>>>> > I had 3 consecutive successful runs processing 200GB data for each >>>>> run >>>>> > before hitting timeout problem in the 4th run. >>>>> > >>>>> > The 5th run couldn't proceed because master complained: >>>>> > >>>>> > 2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: >>>>> Failed >>>>> > assignment of regions to >>>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557, >>>>> > load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983) >>>>> > >>>>> > but sjc1-hadoop6.sjc1 claimed: >>>>> > 2011-02-13 16:13:32,258 DEBUG >>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, >>>>> will >>>>> > retry >>>>> > >>>>> > Here is stack trace for sjc1-hadoop6.sjc1: >>>>> > http://pastebin.com/X8zWLXqu >>>>> > >>>>> > I didn't have chance to capture master stack trace as master exited >>>>> after >>>>> > that. >>>>> > >>>>> > I also attach master and region server log on sjc1-hadoop6.sjc1 - >>>>> pardon me >>>>> > for including individual email addresses as attachments wouldn't go >>>>> through >>>>> > hbase.apache.org >>>>> > >>>>> > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> > Thanks for the explanation. >>>>> >> > Assuming the mixed class loading is static, why did this situation >>>>> >> > develop >>>>> >> > after 40 minutes of heavy load :-( >>>>> >> > >>>>> >> >>>>> >> You didn't hit global memstore pressure until 40 minutes of load. >>>>> >> >>>>> >> -Todd >>>>> >> >>>>> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <[email protected]> >>>>> wrote: >>>>> >> > >>>>> >> > > It's a standard linking issue, you get one class from one >>>>> version >>>>> >> > > another from another, they are mostly compatible in terms of >>>>> >> > > signatures (hence no exceptions) but are subtly incompatible in >>>>> >> > > different ways. In the stack trace you posted, the handlers were >>>>> >> > > blocked in: >>>>> >> > > >>>>> >> > > at >>>>> >> > > >>>>> >> > >>>>> >> > >>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382) >>>>> >> > > >>>>> >> > > and the thread: >>>>> >> > > >>>>> >> > > "regionserver60020.cacheFlusher" daemon prio=10 >>>>> tid=0x00002aaabc21e000 >>>>> >> > > nid=0x7717 waiting for monitor entry [0x0000000000000000] >>>>> >> > > java.lang.Thread.State: BLOCKED (on object monitor) >>>>> >> > > >>>>> >> > > was idle. >>>>> >> > > >>>>> >> > > The cache flusher thread should be flushing, and yet it's doing >>>>> >> > > nothing. This also happens to be one of the classes that were >>>>> >> > > changed. >>>>> >> > > >>>>> >> > > >>>>> >> > > >>>>> >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <[email protected]> >>>>> wrote: >>>>> >> > > > Can someone comment on my second question ? >>>>> >> > > > Thanks >>>>> >> > > > >>>>> >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson < >>>>> [email protected]> >>>>> >> > wrote: >>>>> >> > > > >>>>> >> > > >> As I suspected. >>>>> >> > > >> >>>>> >> > > >> It's a byproduct of our maven assembly process. The process >>>>> could >>>>> >> > > >> be >>>>> >> > > >> fixed. I wouldn't mind. I don't support runtime checking of >>>>> jars, >>>>> >> > > >> there is such thing as too much tests, and this is an example >>>>> of >>>>> >> > > >> it. >>>>> >> > > >> The check would then need a test, etc, etc. >>>>> >> > > >> >>>>> >> > > >> At SU we use new directories for each upgrade, copying the >>>>> config >>>>> >> > > >> over. With the lack of -default.xml this is easier than ever >>>>> (just >>>>> >> > > >> copy everything in conf/). With symlink switchover it makes >>>>> roll >>>>> >> > > >> forward/back as simple as doing a symlink switchover or back. >>>>> I >>>>> >> > > >> have >>>>> >> > > >> to recommend this to everyone who doesnt have a management >>>>> scheme. >>>>> >> > > >> >>>>> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <[email protected] >>>>> > >>>>> >> > > >> wrote: >>>>> >> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the >>>>> >> > > >> > classpath. >>>>> >> > > >> > I wonder >>>>> >> > > >> > 1. why hbase jar is placed in two directories - 0.20.6 >>>>> didn't use >>>>> >> > such >>>>> >> > > >> > structure >>>>> >> > > >> > 2. what from lib/hbase-0.90.0.jar could be picked up and >>>>> why >>>>> >> > > >> > there >>>>> >> > > wasn't >>>>> >> > > >> > exception in server log >>>>> >> > > >> > >>>>> >> > > >> > I think a JIRA should be filed for item 2 above - bail out >>>>> when >>>>> >> > > >> > the >>>>> >> > > two >>>>> >> > > >> > hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of >>>>> different >>>>> >> > > >> versions. >>>>> >> > > >> > >>>>> >> > > >> > Cheers >>>>> >> > > >> > >>>>> >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson < >>>>> [email protected]> >>>>> >> > > wrote: >>>>> >> > > >> > >>>>> >> > > >> >> What do you get when you: >>>>> >> > > >> >> >>>>> >> > > >> >> ls lib/hbase* >>>>> >> > > >> >> >>>>> >> > > >> >> I'm going to guess there is hbase-0.90.0.jar there >>>>> >> > > >> >> >>>>> >> > > >> >> >>>>> >> > > >> >> >>>>> >> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu < >>>>> [email protected]> >>>>> >> > wrote: >>>>> >> > > >> >> > hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist >>>>> >> > > >> >> > Would this be a problem ? >>>>> >> > > >> >> > >>>>> >> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson >>>>> >> > > >> >> > <[email protected] >>>>> >> > > >>>>> >> > > >> wrote: >>>>> >> > > >> >> > >>>>> >> > > >> >> >> You don't have both the old and the new hbase jars in >>>>> there >>>>> >> > > >> >> >> do >>>>> >> > > you? >>>>> >> > > >> >> >> >>>>> >> > > >> >> >> -ryan >>>>> >> > > >> >> >> >>>>> >> > > >> >> >> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu < >>>>> [email protected]> >>>>> >> > > wrote: >>>>> >> > > >> >> >> > .META. went offline during second flow attempt. >>>>> >> > > >> >> >> > >>>>> >> > > >> >> >> > The time out I mentioned happened for 1st and 3rd >>>>> attempts. >>>>> >> > > HBase >>>>> >> > > >> was >>>>> >> > > >> >> >> > restarted before the 1st and 3rd attempts. >>>>> >> > > >> >> >> > >>>>> >> > > >> >> >> > Here is jstack: >>>>> >> > > >> >> >> > http://pastebin.com/EHMSvsRt >>>>> >> > > >> >> >> > >>>>> >> > > >> >> >> > On Thu, Feb 10, 2011 at 3:04 PM, Stack < >>>>> [email protected]> >>>>> >> > > wrote: >>>>> >> > > >> >> >> > >>>>> >> > > >> >> >> >> So, .META. is not online? What happens if you use >>>>> shell >>>>> >> > > >> >> >> >> at >>>>> >> > > this >>>>> >> > > >> >> time. >>>>> >> > > >> >> >> >> >>>>> >> > > >> >> >> >> Your attachement did not come across Ted. Mind >>>>> >> > > >> >> >> >> postbin'ing >>>>> >> > it? >>>>> >> > > >> >> >> >> >>>>> >> > > >> >> >> >> St.Ack >>>>> >> > > >> >> >> >> >>>>> >> > > >> >> >> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu >>>>> >> > > >> >> >> >> <[email protected] >>>>> >> > > >>>>> >> > > >> wrote: >>>>> >> > > >> >> >> >> > I replaced hbase jar with hbase-0.90.1.jar >>>>> >> > > >> >> >> >> > I also upgraded client side jar to >>>>> hbase-0.90.1.jar >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > Our map tasks were running faster than before for >>>>> about >>>>> >> > > >> >> >> >> > 50 >>>>> >> > > >> minutes. >>>>> >> > > >> >> >> >> However, >>>>> >> > > >> >> >> >> > map tasks then timed out calling flushCommits(). >>>>> This >>>>> >> > > happened >>>>> >> > > >> even >>>>> >> > > >> >> >> after >>>>> >> > > >> >> >> >> > fresh restart of hbase. >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > I don't see any exception in region server logs. >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > In master log, I found: >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,286 DEBUG >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > >>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>>> >> > > >> Opened >>>>> >> > > >> >> >> region >>>>> >> > > >> >> >> >> > -ROOT-,,0.70236052 on sjc1-hadoop6.X.com >>>>> >> > ,60020,1297362251595 >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,349 INFO >>>>> >> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker: >>>>> >> > > >> >> >> >> > Failed verification of .META.,,1 at address=null; >>>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >>>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException: >>>>> >> > > >> >> >> >> > Region >>>>> >> > is >>>>> >> > > not >>>>> >> > > >> >> >> online: >>>>> >> > > >> >> >> >> > .META.,,1 >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,350 DEBUG >>>>> >> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>> >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000 Creating (or >>>>> updating) >>>>> >> > > unassigned >>>>> >> > > >> >> node >>>>> >> > > >> >> >> for >>>>> >> > > >> >> >> >> > 1028785192 with OFFLINE state >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > I am attaching region server (which didn't respond >>>>> to >>>>> >> > > >> >> stop-hbase.sh) >>>>> >> > > >> >> >> >> jstack. >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > FYI >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > On Thu, Feb 10, 2011 at 10:10 AM, Stack >>>>> >> > > >> >> >> >> > <[email protected]> >>>>> >> > > >> wrote: >>>>> >> > > >> >> >> >> >> >>>>> >> > > >> >> >> >> >> Thats probably enough Ted. The 0.90.1 >>>>> >> > > >> >> >> >> >> hbase-default.xml >>>>> >> > has >>>>> >> > > an >>>>> >> > > >> >> extra >>>>> >> > > >> >> >> >> >> config. to enable the experimental HBASE-3455 >>>>> feature >>>>> >> > > >> >> >> >> >> but >>>>> >> > > you >>>>> >> > > >> can >>>>> >> > > >> >> >> copy >>>>> >> > > >> >> >> >> >> that over if you want to try playing with it (it >>>>> >> > > >> >> >> >> >> defaults >>>>> >> > > off >>>>> >> > > >> so >>>>> >> > > >> >> >> you'd >>>>> >> > > >> >> >> >> >> copy over the config. if you wanted to set it to >>>>> true). >>>>> >> > > >> >> >> >> >> >>>>> >> > > >> >> >> >> >> St.Ack >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> > >>>>> >> > > >> >> >> >> >>>>> >> > > >> >> >> > >>>>> >> > > >> >> >> >>>>> >> > > >> >> > >>>>> >> > > >> >> >>>>> >> > > >> > >>>>> >> > > >> >>>>> >> > > > >>>>> >> > > >>>>> >> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Todd Lipcon >>>>> >> Software Engineer, Cloudera >>>>> > >>>>> > >>>>> >>>> >>>> >>> >> >
