On Mon, Feb 14, 2011 at 6:23 PM, Ted Yu <[email protected]> wrote: > Looks like the region server came down after I left office. > Still, the second flow wasn't able to proceed writing to hbase. I hope such > stress test can pass. > > Does the same stress test pass on 0.90.0?
I don't think failures should block 0.90.1, unless the same test passes on 0.90.0. ie so long as 0.90.1 is better than our previous release, we should release it, and then get to work on 0.90.2 :) -Todd > > On Mon, Feb 14, 2011 at 5:23 PM, Todd Lipcon <[email protected]> wrote: > > > Looks like it's just in the process of flushing something: > > > > "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000050ce3000 > > nid=0x32a9 runnable [0x0000000043138000] > > java.lang.Thread.State: RUNNABLE > > at > > > > > org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:797) > > at > > > > > org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:479) > > - locked <0x000000072aecd6b8> (a java.lang.Object) > > at > > org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448) > > at > > org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81) > > at > > > > > org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1508) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967) > > at > > > > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:894) > > at > > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846) > > at > > > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386) > > at > > > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194) > > at > > > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214) > > > > ie this is expected behavior. Does it not go down eventually? > > > > On Mon, Feb 14, 2011 at 5:17 PM, Ted Yu <[email protected]> wrote: > > > > > I applied Todd's patch from HBASE-3531 locally and built hbase jar. > > > > > > I restarted cluster. The first flow over 200GB data succeeded. > > > The second flow got stuck. So I tried to shutdown the cluster. > > > > > > Here is stack trace for one of the region servers that refused to go > > down: > > > http://pastebin.com/yJQhhYp8 > > > > > > This is particularly interesting (other threads were waiting to lock > > > 0x000000070be8a028): > > > > > > 1. "IPC Server handler 5 on 60020" daemon prio=10 > > tid=0x00002aaab883c000 > > > nid=0x32bb waiting on condition [0x0000000043f46000] > > > 2. java.lang.Thread.State: WAITING (parking) > > > 3. at sun.misc.Unsafe.park(Native Method) > > > 4. - parking to wait for <0x000000070ca468a8> (a > > > java.util.concurrent.locks.ReentrantLock$NonfairSync) > > > 5. at > > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > > > 6. at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) > > > 7. at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842) > > > 8. at > > > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178) > > > 9. at > > > > > > > > > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186) > > > 10. at > > > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) > > > 11. at > > > > > > > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429) > > > 12. - locked <0x000000070be8a028> (a > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher) > > > 13. at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562) > > > > > > Thanks for your attention. > > > > > > On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon <[email protected]> > wrote: > > > > > > > Gotcha, OK. I think I understand what's going on then. > > > > I'm guessing you have a TON of log messages like this? > > > > > > > > 2011-02-14 06:27:23,116 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > NOT flushing memstore for region > > > > > > > > > > > > > > NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3., > > > > flushing=false, writesEnabled=false > > > > > > > > > > > > -Todd > > > > > > > > On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu <[email protected]> wrote: > > > > > > > > > The lock up happened before I shut down region server. > > > > > I had to get out of that situation so that I can continue 0.90.1 > > > > > validation. > > > > > > > > > > On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon <[email protected]> > > > wrote: > > > > > > > > > > > Hi Ted, > > > > > > > > > > > > What was the reason for the region server shutdown? ie was it > > > aborting > > > > > > itself, or you send it a kill signal, or what? > > > > > > > > > > > > Still trying to understand why this happened. > > > > > > > > > > > > -Todd > > > > > > > > > > > > On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu <[email protected]> > > > wrote: > > > > > > > > > > > > > Here is stack trace for one region server which didn't cleanly > > > > > shutdown: > > > > > > > http://pastebin.com/PEtEdi4g > > > > > > > > > > > > > > I noticed IPC Server handler 7 was holding lock in > > > > > > reclaimMemStoreMemory(). > > > > > > > > > > > > > > Here is related snippet from the region server's log: > > > > > > > http://pastebin.com/KeXppURX > > > > > > > > > > > > > > I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 > > was > > > > > > > splitting. > > > > > > > > > > > > > > Please advise whether there could be any relation between the > > above > > > > > > snippet > > > > > > > and lock up in MemStoreFlusher. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <[email protected]> > > > wrote: > > > > > > > > > > > > > > > I disabled MSLAB. > > > > > > > > My flow still couldn't make much progress. > > > > > > > > > > > > > > > > > > > > > > > >> In this region server stack trace, I don't see > > > > > > > >> MemStoreFlusher.reclaimMemStoreMemory() call: > > > > > > > >> http://pastebin.com/uiBRidUa > > > > > > > >> > > > > > > > >> > > > > > > > >> On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu < > [email protected]> > > > > > wrote: > > > > > > > >> > > > > > > > >>> I am using hadoop-core-0.20.2-322.jar downloaded from > Ryan's > > > > repo. > > > > > > > >>> FYI > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu < > [email protected] > > > > > > > > wrote: > > > > > > > >>> > > > > > > > >>>> Since master server shut down, I restarted the cluster. > > > > > > > >>>> The next flow over 200GB data got timed out. > > > > > > > >>>> > > > > > > > >>>> Here are some region server stat: > > > > > > > >>>> > > > > > > > >>>> request=0.0, regions=95, stores=213, storefiles=65, > > > > > > > >>>> storefileIndexSize=99, memstoreSize=1311, > > > compactionQueueSize=0, > > > > > > > >>>> flushQueueSize=0, usedHeap=2532, maxHeap=3983, > > > > > > blockCacheSize=6853968, > > > > > > > >>>> blockCacheFree=828520304, blockCacheCount=0, > > > > blockCacheHitCount=0, > > > > > > > >>>> blockCacheMissCount=0, blockCacheEvictedCount=0, > > > > > > blockCacheHitRatio=0, > > > > > > > >>>> blockCacheHitCachingRatio=0 > > > > > > > >>>> > > > > > > > >>>> request=0.0, regions=95, stores=232, storefiles=72, > > > > > > > >>>> storefileIndexSize=120, memstoreSize=301, > > > compactionQueueSize=0, > > > > > > > >>>> flushQueueSize=0, usedHeap=1740, maxHeap=3983, > > > > > > > blockCacheSize=13110928, > > > > > > > >>>> blockCacheFree=822263344, blockCacheCount=712, > > > > > > > blockCacheHitCount=112478, > > > > > > > >>>> blockCacheMissCount=712, blockCacheEvictedCount=0, > > > > > > > blockCacheHitRatio=99, > > > > > > > >>>> blockCacheHitCachingRatio=99 > > > > > > > >>>> > > > > > > > >>>> Thanks > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson < > > > > [email protected] > > > > > > > >wrote: > > > > > > > >>>> > > > > > > > >>>>> every handler thread, and every reader and also the > accept > > > > thread > > > > > > are > > > > > > > >>>>> all blocked on flushing memstore. The handlers get > > blocked, > > > > then > > > > > > the > > > > > > > >>>>> readers also have a finite handoff queue and they are > > blocked > > > > and > > > > > > > also > > > > > > > >>>>> the accept. > > > > > > > >>>>> > > > > > > > >>>>> But why isnt memstore flushing? Do you have regionserver > > > > stats? > > > > > > ie: > > > > > > > >>>>> how much memstore global ram used? That is found on the > > main > > > > > page > > > > > > of > > > > > > > >>>>> the regionserver http service, also found in ganglia/file > > > > stats. > > > > > > > >>>>> > > > > > > > >>>>> I havent looked at the logs yet, I'm off to lunch now. > > > > > > > >>>>> > > > > > > > >>>>> -ryan > > > > > > > >>>>> > > > > > > > >>>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu < > > [email protected] > > > > > > > > > > wrote: > > > > > > > >>>>> > I had 3 consecutive successful runs processing 200GB > data > > > for > > > > > > each > > > > > > > >>>>> run > > > > > > > >>>>> > before hitting timeout problem in the 4th run. > > > > > > > >>>>> > > > > > > > > >>>>> > The 5th run couldn't proceed because master complained: > > > > > > > >>>>> > > > > > > > > >>>>> > 2011-02-13 16:11:45,173 FATAL > > > > > > > org.apache.hadoop.hbase.master.HMaster: > > > > > > > >>>>> Failed > > > > > > > >>>>> > assignment of regions to > > > > > > > >>>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com > > > > > ,60020,1297518996557, > > > > > > > >>>>> > load=(requests=0, regions=231, usedHeap=3535, > > maxHeap=3983) > > > > > > > >>>>> > > > > > > > > >>>>> > but sjc1-hadoop6.sjc1 claimed: > > > > > > > >>>>> > 2011-02-13 16:13:32,258 DEBUG > > > > > > > >>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: No > > > master > > > > > > > found, > > > > > > > >>>>> will > > > > > > > >>>>> > retry > > > > > > > >>>>> > > > > > > > > >>>>> > Here is stack trace for sjc1-hadoop6.sjc1: > > > > > > > >>>>> > http://pastebin.com/X8zWLXqu > > > > > > > >>>>> > > > > > > > > >>>>> > I didn't have chance to capture master stack trace as > > > master > > > > > > exited > > > > > > > >>>>> after > > > > > > > >>>>> > that. > > > > > > > >>>>> > > > > > > > > >>>>> > I also attach master and region server log on > > > > sjc1-hadoop6.sjc1 > > > > > - > > > > > > > >>>>> pardon me > > > > > > > >>>>> > for including individual email addresses as attachments > > > > > wouldn't > > > > > > go > > > > > > > >>>>> through > > > > > > > >>>>> > hbase.apache.org > > > > > > > >>>>> > > > > > > > > >>>>> > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon < > > > > > [email protected]> > > > > > > > >>>>> wrote: > > > > > > > >>>>> >> > > > > > > > >>>>> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu < > > > > [email protected]> > > > > > > > >>>>> wrote: > > > > > > > >>>>> >> > > > > > > > >>>>> >> > Thanks for the explanation. > > > > > > > >>>>> >> > Assuming the mixed class loading is static, why did > > this > > > > > > > situation > > > > > > > >>>>> >> > develop > > > > > > > >>>>> >> > after 40 minutes of heavy load :-( > > > > > > > >>>>> >> > > > > > > > > >>>>> >> > > > > > > > >>>>> >> You didn't hit global memstore pressure until 40 > minutes > > > of > > > > > > load. > > > > > > > >>>>> >> > > > > > > > >>>>> >> -Todd > > > > > > > >>>>> >> > > > > > > > >>>>> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson < > > > > > > [email protected]> > > > > > > > >>>>> wrote: > > > > > > > >>>>> >> > > > > > > > > >>>>> >> > > It's a standard linking issue, you get one class > > from > > > > one > > > > > > > >>>>> version > > > > > > > >>>>> >> > > another from another, they are mostly compatible > in > > > > terms > > > > > of > > > > > > > >>>>> >> > > signatures (hence no exceptions) but are subtly > > > > > incompatible > > > > > > > in > > > > > > > >>>>> >> > > different ways. In the stack trace you posted, the > > > > > handlers > > > > > > > were > > > > > > > >>>>> >> > > blocked in: > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > at > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > > > > > > > >>>>> >> > > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382) > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > and the thread: > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > "regionserver60020.cacheFlusher" daemon prio=10 > > > > > > > >>>>> tid=0x00002aaabc21e000 > > > > > > > >>>>> >> > > nid=0x7717 waiting for monitor entry > > > > [0x0000000000000000] > > > > > > > >>>>> >> > > java.lang.Thread.State: BLOCKED (on object > > monitor) > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > was idle. > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > The cache flusher thread should be flushing, and > yet > > > > it's > > > > > > > doing > > > > > > > >>>>> >> > > nothing. This also happens to be one of the > classes > > > > that > > > > > > were > > > > > > > >>>>> >> > > changed. > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu < > > > > > > [email protected]> > > > > > > > >>>>> wrote: > > > > > > > >>>>> >> > > > Can someone comment on my second question ? > > > > > > > >>>>> >> > > > Thanks > > > > > > > >>>>> >> > > > > > > > > > > >>>>> >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson < > > > > > > > >>>>> [email protected]> > > > > > > > >>>>> >> > wrote: > > > > > > > >>>>> >> > > > > > > > > > > >>>>> >> > > >> As I suspected. > > > > > > > >>>>> >> > > >> > > > > > > > >>>>> >> > > >> It's a byproduct of our maven assembly process. > > The > > > > > > process > > > > > > > >>>>> could > > > > > > > >>>>> >> > > >> be > > > > > > > >>>>> >> > > >> fixed. I wouldn't mind. I don't support runtime > > > > > checking > > > > > > of > > > > > > > >>>>> jars, > > > > > > > >>>>> >> > > >> there is such thing as too much tests, and this > > is > > > an > > > > > > > example > > > > > > > >>>>> of > > > > > > > >>>>> >> > > >> it. > > > > > > > >>>>> >> > > >> The check would then need a test, etc, etc. > > > > > > > >>>>> >> > > >> > > > > > > > >>>>> >> > > >> At SU we use new directories for each upgrade, > > > > copying > > > > > > the > > > > > > > >>>>> config > > > > > > > >>>>> >> > > >> over. With the lack of -default.xml this is > > easier > > > > than > > > > > > > ever > > > > > > > >>>>> (just > > > > > > > >>>>> >> > > >> copy everything in conf/). With symlink > > switchover > > > > it > > > > > > > makes > > > > > > > >>>>> roll > > > > > > > >>>>> >> > > >> forward/back as simple as doing a symlink > > > switchover > > > > or > > > > > > > back. > > > > > > > >>>>> I > > > > > > > >>>>> >> > > >> have > > > > > > > >>>>> >> > > >> to recommend this to everyone who doesnt have a > > > > > > management > > > > > > > >>>>> scheme. > > > > > > > >>>>> >> > > >> > > > > > > > >>>>> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu < > > > > > > > [email protected] > > > > > > > >>>>> > > > > > > > > >>>>> >> > > >> wrote: > > > > > > > >>>>> >> > > >> > hbase/hbase-0.90.1.jar leads > > lib/hbase-0.90.0.jar > > > > in > > > > > > the > > > > > > > >>>>> >> > > >> > classpath. > > > > > > > >>>>> >> > > >> > I wonder > > > > > > > >>>>> >> > > >> > 1. why hbase jar is placed in two directories > - > > > > > 0.20.6 > > > > > > > >>>>> didn't use > > > > > > > >>>>> >> > such > > > > > > > >>>>> >> > > >> > structure > > > > > > > >>>>> >> > > >> > 2. what from lib/hbase-0.90.0.jar could be > > picked > > > > up > > > > > > and > > > > > > > >>>>> why > > > > > > > >>>>> >> > > >> > there > > > > > > > >>>>> >> > > wasn't > > > > > > > >>>>> >> > > >> > exception in server log > > > > > > > >>>>> >> > > >> > > > > > > > > >>>>> >> > > >> > I think a JIRA should be filed for item 2 > above > > - > > > > > bail > > > > > > > out > > > > > > > >>>>> when > > > > > > > >>>>> >> > > >> > the > > > > > > > >>>>> >> > > two > > > > > > > >>>>> >> > > >> > hbase jars from $HBASE_HOME and > $HBASE_HOME/lib > > > are > > > > > of > > > > > > > >>>>> different > > > > > > > >>>>> >> > > >> versions. > > > > > > > >>>>> >> > > >> > > > > > > > > >>>>> >> > > >> > Cheers > > > > > > > >>>>> >> > > >> > > > > > > > > >>>>> >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson > < > > > > > > > >>>>> [email protected]> > > > > > > > >>>>> >> > > wrote: > > > > > > > >>>>> >> > > >> > > > > > > > > >>>>> >> > > >> >> What do you get when you: > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> >> ls lib/hbase* > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> >> I'm going to guess there is hbase-0.90.0.jar > > > there > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu < > > > > > > > >>>>> [email protected]> > > > > > > > >>>>> >> > wrote: > > > > > > > >>>>> >> > > >> >> > hbase-0.90.0-tests.jar and > hbase-0.90.1.jar > > > > > co-exist > > > > > > > >>>>> >> > > >> >> > Would this be a problem ? > > > > > > > >>>>> >> > > >> >> > > > > > > > > >>>>> >> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM, Ryan > Rawson > > > > > > > >>>>> >> > > >> >> > <[email protected] > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > >> wrote: > > > > > > > >>>>> >> > > >> >> > > > > > > > > >>>>> >> > > >> >> >> You don't have both the old and the new > > hbase > > > > > jars > > > > > > in > > > > > > > >>>>> there > > > > > > > >>>>> >> > > >> >> >> do > > > > > > > >>>>> >> > > you? > > > > > > > >>>>> >> > > >> >> >> > > > > > > > >>>>> >> > > >> >> >> -ryan > > > > > > > >>>>> >> > > >> >> >> > > > > > > > >>>>> >> > > >> >> >> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu < > > > > > > > >>>>> [email protected]> > > > > > > > >>>>> >> > > wrote: > > > > > > > >>>>> >> > > >> >> >> > .META. went offline during second flow > > > > attempt. > > > > > > > >>>>> >> > > >> >> >> > > > > > > > > >>>>> >> > > >> >> >> > The time out I mentioned happened for > 1st > > > and > > > > > 3rd > > > > > > > >>>>> attempts. > > > > > > > >>>>> >> > > HBase > > > > > > > >>>>> >> > > >> was > > > > > > > >>>>> >> > > >> >> >> > restarted before the 1st and 3rd > > attempts. > > > > > > > >>>>> >> > > >> >> >> > > > > > > > > >>>>> >> > > >> >> >> > Here is jstack: > > > > > > > >>>>> >> > > >> >> >> > http://pastebin.com/EHMSvsRt > > > > > > > >>>>> >> > > >> >> >> > > > > > > > > >>>>> >> > > >> >> >> > On Thu, Feb 10, 2011 at 3:04 PM, Stack > < > > > > > > > >>>>> [email protected]> > > > > > > > >>>>> >> > > wrote: > > > > > > > >>>>> >> > > >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> So, .META. is not online? What > happens > > if > > > > you > > > > > > use > > > > > > > >>>>> shell > > > > > > > >>>>> >> > > >> >> >> >> at > > > > > > > >>>>> >> > > this > > > > > > > >>>>> >> > > >> >> time. > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> >> Your attachement did not come across > > Ted. > > > > > Mind > > > > > > > >>>>> >> > > >> >> >> >> postbin'ing > > > > > > > >>>>> >> > it? > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> >> St.Ack > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted > Yu > > > > > > > >>>>> >> > > >> >> >> >> <[email protected] > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > >> wrote: > > > > > > > >>>>> >> > > >> >> >> >> > I replaced hbase jar with > > > hbase-0.90.1.jar > > > > > > > >>>>> >> > > >> >> >> >> > I also upgraded client side jar to > > > > > > > >>>>> hbase-0.90.1.jar > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > Our map tasks were running faster > than > > > > > before > > > > > > > for > > > > > > > >>>>> about > > > > > > > >>>>> >> > > >> >> >> >> > 50 > > > > > > > >>>>> >> > > >> minutes. > > > > > > > >>>>> >> > > >> >> >> >> However, > > > > > > > >>>>> >> > > >> >> >> >> > map tasks then timed out calling > > > > > > flushCommits(). > > > > > > > >>>>> This > > > > > > > >>>>> >> > > happened > > > > > > > >>>>> >> > > >> even > > > > > > > >>>>> >> > > >> >> >> after > > > > > > > >>>>> >> > > >> >> >> >> > fresh restart of hbase. > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > I don't see any exception in region > > > server > > > > > > logs. > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > In master log, I found: > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,286 DEBUG > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > > > > > > >>>>> >> > > >> Opened > > > > > > > >>>>> >> > > >> >> >> region > > > > > > > >>>>> >> > > >> >> >> >> > -ROOT-,,0.70236052 on > > > sjc1-hadoop6.X.com > > > > > > > >>>>> >> > ,60020,1297362251595 > > > > > > > >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,349 INFO > > > > > > > >>>>> >> > > >> >> >> >> > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: > > > > > > > >>>>> >> > > >> >> >> >> > Failed verification of .META.,,1 at > > > > > > > address=null; > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > org.apache.hadoop.hbase.NotServingRegionException: > > > > > > > >>>>> >> > > >> >> >> >> > Region > > > > > > > >>>>> >> > is > > > > > > > >>>>> >> > > not > > > > > > > >>>>> >> > > >> >> >> online: > > > > > > > >>>>> >> > > >> >> >> >> > .META.,,1 > > > > > > > >>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,350 DEBUG > > > > > > > >>>>> >> > > >> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > >>>>> >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000 > > Creating > > > > (or > > > > > > > >>>>> updating) > > > > > > > >>>>> >> > > unassigned > > > > > > > >>>>> >> > > >> >> node > > > > > > > >>>>> >> > > >> >> >> for > > > > > > > >>>>> >> > > >> >> >> >> > 1028785192 with OFFLINE state > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > I am attaching region server (which > > > didn't > > > > > > > respond > > > > > > > >>>>> to > > > > > > > >>>>> >> > > >> >> stop-hbase.sh) > > > > > > > >>>>> >> > > >> >> >> >> jstack. > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > FYI > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > On Thu, Feb 10, 2011 at 10:10 AM, > > Stack > > > > > > > >>>>> >> > > >> >> >> >> > <[email protected]> > > > > > > > >>>>> >> > > >> wrote: > > > > > > > >>>>> >> > > >> >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> >> >> Thats probably enough Ted. The > > 0.90.1 > > > > > > > >>>>> >> > > >> >> >> >> >> hbase-default.xml > > > > > > > >>>>> >> > has > > > > > > > >>>>> >> > > an > > > > > > > >>>>> >> > > >> >> extra > > > > > > > >>>>> >> > > >> >> >> >> >> config. to enable the experimental > > > > > HBASE-3455 > > > > > > > >>>>> feature > > > > > > > >>>>> >> > > >> >> >> >> >> but > > > > > > > >>>>> >> > > you > > > > > > > >>>>> >> > > >> can > > > > > > > >>>>> >> > > >> >> >> copy > > > > > > > >>>>> >> > > >> >> >> >> >> that over if you want to try > playing > > > with > > > > > it > > > > > > > (it > > > > > > > >>>>> >> > > >> >> >> >> >> defaults > > > > > > > >>>>> >> > > off > > > > > > > >>>>> >> > > >> so > > > > > > > >>>>> >> > > >> >> >> you'd > > > > > > > >>>>> >> > > >> >> >> >> >> copy over the config. if you wanted > > to > > > > set > > > > > it > > > > > > > to > > > > > > > >>>>> true). > > > > > > > >>>>> >> > > >> >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> >> >> St.Ack > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > > >>>>> >> > > >> >> >> >> > > > > > > > >>>>> >> > > >> >> >> > > > > > > > > >>>>> >> > > >> >> >> > > > > > > > >>>>> >> > > >> >> > > > > > > > > >>>>> >> > > >> >> > > > > > > > >>>>> >> > > >> > > > > > > > > >>>>> >> > > >> > > > > > > > >>>>> >> > > > > > > > > > > >>>>> >> > > > > > > > > > >>>>> >> > > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> > > > > > > > >>>>> >> -- > > > > > > > >>>>> >> Todd Lipcon > > > > > > > >>>>> >> Software Engineer, Cloudera > > > > > > > >>>>> > > > > > > > > >>>>> > > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Todd Lipcon > > > > > > Software Engineer, Cloudera > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Todd Lipcon > > > > Software Engineer, Cloudera > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera
