I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2 to see if it had previous problems that left ZK in a bad state. Here's the regionserver and master logs: http://pastebin.com/qcvHjNCg from about 2 hours earlier.
* Nov 4, 22:34: region is created as daugher b of a split on node 159 * Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move() * Nov 5, 00:24: node 169 tries to split the region but gets Failed create of ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2 Is it possible that if something calls HBaseAdmin.move() on a daugher region that is 30 seconds old, it could move the region but leave that ZK node in a bad state? On Tue, Nov 6, 2012 at 7:13 AM, lars hofhansl <[email protected]> wrote: > Thanks Ram! > > > > > > ----- Original Message ----- > From: ramkrishna vasudevan <[email protected]> > To: [email protected] > Cc: > Sent: Tuesday, November 6, 2012 3:29 AM > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > Raised HBASE-7103 for the same. > > Regards > Ram > > On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan < > [email protected]> wrote: > > > Thanks for the logs. > > I found the reason. > > > > The following steps happen > > -> Initially the parent region P1 starts splitting. > > -> The split is going on normally. > > -> Another split starts at the same time for the same region P1. (Not > sure > > why this started). > > -> Rollback happens seeing an already existing node. > > -> This node gets deleted in rollback and nodeDeleted Event starts. > > -> In nodeDeleted event the RIT for the region P1 gets deleted. > > -> Because of this there is no region in RIT. > > -> Now the first split gets over. Here the problem is we try to transit > > the node to SPLITTING to SPLIT. But the node even does not exist. > > But we don take any action on this. We think it is successful. > > -> Even before HBASE-6854 this could have happened. Will file a JIRA for > > the same. > > > > Regards > > Ram > > > > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[email protected]> wrote: > > > >> Ram, here's the master log corresponding to > http://pastebin.com/cSdMbA2a. > >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the > >> regionserver in question (node 169), so i'm guessing that's > >> 22f8fa73d8af837410ca270f344f6bb8's mommy. > >> > >> btw - you can see my balancer kick in 45 seconds later (runs every 10 > >> minutes) here, but so far i think that's coincidence: > >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: > >> BalanceSwitch=false > >> > >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node > 169 > >> and found all this stuff about a failed split: > >> http://pastebin.com/xtXMZ388 and > >> an attempted rollback. Looks like it errors out when it goes to put a > >> node > >> in ZK but it's already there. I'm not familiar with what a good split > log > >> looks like, so i'll stop commenting for now... > >> > >> > >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < > >> [email protected]> wrote: > >> > >> > The log shows that the first time the region was transitioned to > >> SPLITTING > >> > even then it was not populated in the Master's memory. > >> > > >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < > >> > [email protected]> wrote: > >> > > >> > > Could you attach the master logs at this time > >> > > 2012-11-05 00:24:55? > >> > > > >> > > Regards > >> > > Ram > >> > > > >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[email protected] > >> > >wrote: > >> > > > >> > >> Took a brief look through all SPLIT related commits since 0.94.0... > >> > Found > >> > >> these: > >> > >> > >> > >> HBASE-6854 * > >> > >> HBASE-6713 > >> > >> HBASE-6329 * > >> > >> > >> > >> HBASE-6088 > >> > >> > >> > >> HBASE-5986 > >> > >> HBASE-6070 * > >> > >> > >> > >> > >> > >> The ones marked with * are (IMHO) more likely to be related. > >> > >> > >> > >> -- Lars > >> > >> > >> > >> ________________________________ > >> > >> From: Matt Corgan <[email protected]> > >> > >> To: dev <[email protected]>; lars hofhansl <[email protected] > > > >> > >> Sent: Monday, November 5, 2012 9:28 PM > >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > >> > >> > >> > >> Yeah - we were running .94.0 since it came out but never saw it > >> there. > >> > >> I'll keep trying to narrow it down. The only harm it's causing is > >> log > >> > >> spam and failing to move daughters to a new regionserver, which are > >> > >> definitely problems, but it's not bringing down the cluster. > >> > >> > >> > >> > >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[email protected] > > > >> > >> wrote: > >> > >> > >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you > >> have > >> > >> > never seen this before? > >> > >> > > >> > >> > > >> > >> > -- Lars > >> > >> > > >> > >> > > >> > >> > > >> > >> > ________________________________ > >> > >> > From: Matt Corgan <[email protected]> > >> > >> > To: dev <[email protected]> > >> > >> > Sent: Monday, November 5, 2012 9:10 PM > >> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > >> > >> > > >> > >> > It happened in this new table that I have all the logs for. The > >> > region > >> > >> in > >> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the > >> first > >> > >> trace > >> > >> > of this region is on the regionserver with the "Intantiated > >> > >> tableName..." > >> > >> > message > >> > >> > > >> > >> > 2012-11-05 22:24:21,162 DEBUG > >> > >> org.apache.hadoop.hbase.regionserver.HRegion: > >> > >> > Instantiated > >> > >> > > >> > >> > > >> > >> > >> > > >> > StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 > >> > >> > 9784c21cbc2. > >> > >> > > >> > >> > I also know this region came from a recent split, but I can't > find > >> any > >> > >> log > >> > >> > messages show the parent finishing the split that created this > >> > daughter > >> > >> > region. So my guess now is that the split is actually finishing > >> and > >> > >> > letting clients continue to write data, but something is failing > to > >> > >> print > >> > >> > the log line and correctly tell the master about the new region. > >> > >> > > >> > >> > I've noticed that these regions are showing up on the clients in > >> calls > >> > >> to > >> > >> > HTable.getRegionLocations(), so the clients continue to function, > >> but > >> > >> if I > >> > >> > call HBaseAdmin.move() i get an UnknownRegionException. > >> > >> > > >> > >> > > >> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[email protected]> > >> wrote: > >> > >> > > >> > >> > > I think what Matt encountered was this: > >> > >> > > https://issues.apache.org/jira/browse/HBASE-7101 > >> > >> > > > >> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan < > [email protected] > >> > > >> > >> wrote: > >> > >> > > > >> > >> > > > Beats me - i can't find any record of it before that. > >> > >> > > > > >> > >> > > > I'm importing data into another table now. I > disabled/enabled > >> the > >> > >> > table > >> > >> > > > first to make sure we have the original 4 region locations > >> logged > >> > >> > > > everywhere. Will report back... > >> > >> > > > > >> > >> > > > > >> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < > >> > >> > [email protected] > >> > >> > > > >wrote: > >> > >> > > > > >> > >> > > > > Wow, and can you figure how this happened? > >> > >> > > > > > >> > >> > > > > 2012-11-05 00:24:54,538 DEBUG > >> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. > >> > >> > > > > > >> > >> > > > > Split? Else how did the master told that RS to open a > region > >> > that > >> > >> > > doesn't > >> > >> > > > > exist? > >> > >> > > > > > >> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan < > >> > [email protected]> > >> > >> > > wrote: > >> > >> > > > > > Here's one from last night with the master logs at the > >> bottom. > >> > >> > > > > > http://pastebin.com/cSdMbA2a > >> > >> > > > > > > >> > >> > > > > > I don't see that region in the master logs for 4 days > back. > >> > I'm > >> > >> > > going > >> > >> > > > to > >> > >> > > > > > import some new data from scratch soon and will be sure > to > >> > keep > >> > >> all > >> > >> > > the > >> > >> > > > > > master logs. > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < > >> > >> > > > [email protected] > >> > >> > > > > >wrote: > >> > >> > > > > > > >> > >> > > > > >> This reminds me a lot of > >> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792 > >> > >> > > > > >> > >> > >> > > > > >> What I'd like to see is the log from the first time the > >> > master > >> > >> > > > > >> receives the split message. I guess it says the region > >> > doesn't > >> > >> > exist > >> > >> > > > > >> anymore because the split was processed already in the > >> master > >> > >> but > >> > >> > > > > >> there's a failure mode similar to 4792. > >> > >> > > > > >> > >> > >> > > > > >> I saw this on another cluster last week but the logs > were > >> > >> rolling > >> > >> > > > > >> based on size so the original split message was lost. > >> > >> > > > > >> > >> > >> > > > > >> J-D > >> > >> > > > > >> > >> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan < > >> > >> [email protected]> > >> > >> > > > > wrote: > >> > >> > > > > >> > I've had a few successful splits today without any > >> errors. > >> > >> I'll > >> > >> > > > turn > >> > >> > > > > up > >> > >> > > > > >> > the importer speed tomorrow to start stressing the > >> cluster > >> > >> more. > >> > >> > > > > >> > > >> > >> > > > > >> > > >> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu < > >> > [email protected]> > >> > >> > > wrote: > >> > >> > > > > >> > > >> > >> > > > > >> >> Matt: > >> > >> > > > > >> >> Was there any region server crash before you noticed > >> the > >> > >> > repeated > >> > >> > > > log > >> > >> > > > > >> for > >> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ? > >> > >> > > > > >> >> > >> > >> > > > > >> >> I wonder if the following JIRA might be related: > >> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign > >> > missing > >> > >> > > > daughter > >> > >> > > > > >> >> > >> > >> > > > > >> >> Thanks > >> > >> > > > > >> >> > >> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan < > >> > >> > [email protected] > >> > >> > > > > >> > >> > > > > >> wrote: > >> > >> > > > > >> >> > >> > >> > > > > >> >> > Strangely, I don't see any record of that region in > >> the > >> > >> > master > >> > >> > > > > before > >> > >> > > > > >> >> what > >> > >> > > > > >> >> > I already pasted even though I have logs back to > >> 10/30. > >> > >> Next > >> > >> > > > time > >> > >> > > > > it > >> > >> > > > > >> >> > happens I'll gather a full log record and try to > >> debug > >> > >> while > >> > >> > > it's > >> > >> > > > > >> >> > occurring. > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu > >> > chintaguntla < > >> > >> > > > > >> >> > [email protected]> wrote: > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > Hi Matt, > >> > >> > > > > >> >> > > can you paste some more master logs of region > >> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. > >> > >> > > > > >> >> > > I think Its not problem with splitting. > >> > >> > > > > >> >> > > We are getting > >> > >> > > > > >> >> > > LOG.warn("Region " + encodedName + " not > >> found > >> > on > >> > >> > > server > >> > >> > > > " > >> > >> > > > > + > >> > >> > > > > >> >> > > serverName + > >> > >> > > > > >> >> > > "; failed processing"); > >> > >> > > > > >> >> > > this log means no entry in servers map(not fully > >> > >> assigned). > >> > >> > > > > >> >> > > Set<HRegionInfo> hris = this.servers.get(sn); > >> > >> > > > > >> >> > > HRegionInfo foundHri = null; > >> > >> > > > > >> >> > > for (HRegionInfo hri: hris) { > >> > >> > > > > >> >> > > if > >> (hri.getEncodedName().equals(encodedName)) { > >> > >> > > > > >> >> > > foundHri = hri; > >> > >> > > > > >> >> > > break; > >> > >> > > > > >> >> > > } > >> > >> > > > > >> >> > > } > >> > >> > > > > >> >> > > return foundHri; > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl < > >> > >> > > > > [email protected]> > >> > >> > > > > >> >> > wrote: > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > CC'ing dev list... > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > Is anybody aware of any changes that went in > >> > recently > >> > >> > that > >> > >> > > > > could > >> > >> > > > > >> >> cause > >> > >> > > > > >> >> > > > this? > >> > >> > > > > >> >> > > > I looked around a bit, but could not find > >> anything > >> > >> > obvious. > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > -- Lars > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > ________________________________ > >> > >> > > > > >> >> > > > From: Matt Corgan <[email protected]> > >> > >> > > > > >> >> > > > To: user <[email protected]> > >> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM > >> > >> > > > > >> >> > > > Subject: Re: infinite loop of > RS_ZK_REGION_SPLIT > >> on > >> > >> .94.2 > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > I think the cluster is ok without running hbck, > >> as > >> > >> > > restarting > >> > >> > > > > the > >> > >> > > > > >> >> > > > regionserver process stops the warnings and > >> > everything > >> > >> > > looks > >> > >> > > > ok > >> > >> > > > > >> >> > > otherwise. > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > here's the regionserver right after the split > >> > happens: > >> > >> > > > > >> >> > > > ------------------------ > >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting > >> to > >> > >> > > transition > >> > >> > > > > node > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > >> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 > Successfully > >> > >> > > > transitioned > >> > >> > > > > >> node > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > >> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG > >> > >> > > > > >> >> > > > > >> > org.apache.hadoop.hbase.regionserver.SplitTransaction: > >> > >> > > Still > >> > >> > > > > >> waiting > >> > >> > > > > >> >> on > >> > >> > > > > >> >> > > the > >> > >> > > > > >> >> > > > master to process the split for > >> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c > >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting > >> to > >> > >> > > transition > >> > >> > > > > node > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > >> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 > Successfully > >> > >> > > > transitioned > >> > >> > > > > >> node > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > >> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > ----------------------------- > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT > to > >> > >> > > > > >> RS_ZK_REGION_SPLIT" > >> > >> > > > > >> >> > > > continues for 15 or so hours and finally > settles > >> > >> without > >> > >> > > > manual > >> > >> > > > > >> >> > > > intervention with these regionserver log > >> messages: > >> > >> > > > > >> >> > > > ----------------------- > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting > >> to > >> > >> > > transition > >> > >> > > > > >> node * > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from > >> > >> RS_ZK_REGION_SPLIT > >> > >> > > to > >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG > >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 > Successfully > >> > >> > > > transitioned > >> > >> > > > > >> node * > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from > >> > >> RS_ZK_REGION_SPLIT > >> > >> > > to > >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO > >> > >> > > > > >> >> > > > > >> org.apache.hadoop.hbase.regionserver.SplitRequest: > >> > >> Region > >> > >> > > > > split, > >> > >> > > > > >> META > >> > >> > > > > >> >> > > > updated, and report to master. > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. > >> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new > regions: > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. > >> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO > >> > >> > > > > >> >> > > > > >> org.apache.hadoop.hbase.regionserver.SplitRequest: > >> > >> Region > >> > >> > > > > split, > >> > >> > > > > >> META > >> > >> > > > > >> >> > > > updated, and report to master. > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., > >> > >> > > > > >> >> > > > new regions: > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > >> > >> > > > > >> > >> > >> > > > > >> > >> > > >> > > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took > >> > 17hrs, > >> > >> > > 14mins, > >> > >> > > > > 2sec > >> > >> > > > > >> >> > > > ---------------------- > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > And the master finally logs this: > >> > >> > > > > >> >> > > > ----------------------- > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN > >> > >> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: > >> > >> Region > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on > >> server > >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; > >> > failed > >> > >> > > > > processing > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN > >> > >> > > > > >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: > >> > >> > Received > >> > >> > > > > SPLIT > >> > >> > > > > >> for > >> > >> > > > > >> >> > > region > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server > >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 > >> but it > >> > >> > > doesn't > >> > >> > > > > exist > >> > >> > > > > >> >> > > anymore, > >> > >> > > > > >> >> > > > probably already processed its split > >> > >> > > > > >> >> > > > ------------------------ > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > I can't find any evidence that the region left > >> the > >> > >> node > >> > >> > > > during > >> > >> > > > > >> that > >> > >> > > > > >> >> > time. > >> > >> > > > > >> >> > > > I'll have to catch it in action next time and > see > >> > what > >> > >> > the > >> > >> > > > > region > >> > >> > > > > >> is > >> > >> > > > > >> >> up > >> > >> > > > > >> >> > > to > >> > >> > > > > >> >> > > > during the problem period. > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > What does it mean that it successfully > >> transitioned > >> > >> from > >> > >> > > > SPLIT > >> > >> > > > > to > >> > >> > > > > >> >> > SPLIT? > >> > >> > > > > >> >> > > > Is that a valid transition? > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > Matt > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu < > >> > >> > > [email protected]> > >> > >> > > > > >> wrote: > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > Matt: > >> > >> > > > > >> >> > > > > From the following we can see that region > >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c > >> > >> > > > > >> >> > > > > cannot be found: > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG > >> > >> > > > > >> >> > > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: > >> > >> > > Handling > >> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT, > >> > >> > > > > >> >> > > > > > >> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279, > >> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c > >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN > >> > >> > > > > >> >> > > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: > >> > >> > Region > >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found > on > >> > >> server > >> > >> > > > > >> >> > > > > > >> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed > >> > >> > > > > >> >> > > > > processing > >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN > >> > >> > > > > >> >> > > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: > >> > >> > > Received > >> > >> > > > > SPLIT > >> > >> > > > > >> >> for > >> > >> > > > > >> >> > > > region > >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server > >> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 > >> but > >> > it > >> > >> > > > doesn't > >> > >> > > > > >> exist > >> > >> > > > > >> >> > > > anymore, > >> > >> > > > > >> >> > > > > probably already processed its split > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ? > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > Thanks > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan < > >> > >> > > > > >> [email protected]> > >> > >> > > > > >> >> > > wrote: > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from > >> > >> yesterday. > >> > >> > > > It's > >> > >> > > > > not > >> > >> > > > > >> >> > > > correlated > >> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but > >> there's > >> > >> 3GB > >> > >> > of > >> > >> > > > > this > >> > >> > > > > >> from > >> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with > >> importing > >> > >> data > >> > >> > so > >> > >> > > > > >> testing > >> > >> > > > > >> >> the > >> > >> > > > > >> >> > > > split > >> > >> > > > > >> >> > > > > > code harder than normal. The regions are > >> > 500-1GB > >> > >> > > > gzipped. > >> > >> > > > > I > >> > >> > > > > >> can > >> > >> > > > > >> >> > > look > >> > >> > > > > >> >> > > > > into > >> > >> > > > > >> >> > > > > > it more but trying to figure out what to > look > >> > for. > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > Thanks Ted, > >> > >> > > > > >> >> > > > > > Matt > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu < > >> > >> > > > > [email protected]> > >> > >> > > > > >> >> > wrote: > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > Matt: > >> > >> > > > > >> >> > > > > > > This is the method which made the > logging: > >> > >> > > > > >> >> > > > > > > private static int > >> > >> > tickleNodeSplit(ZooKeeperWatcher > >> > >> > > > > zkw, > >> > >> > > > > >> >> > > > > > > HRegionInfo parent, HRegionInfo a, > >> > >> > HRegionInfo > >> > >> > > b, > >> > >> > > > > >> >> > ServerName > >> > >> > > > > >> >> > > > > > > serverName, > >> > >> > > > > >> >> > > > > > > final int znodeVersion) > >> > >> > > > > >> >> > > > > > > throws KeeperException, IOException { > >> > >> > > > > >> >> > > > > > > byte [] payload = > Writables.getBytes(a, > >> > b); > >> > >> > > > > >> >> > > > > > > return ZKAssign.transitionNode(zkw, > >> > parent, > >> > >> > > > > serverName, > >> > >> > > > > >> >> > > > > > > EventType.RS_ZK_REGION_SPLIT, > >> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT, > >> > >> > > > > >> >> > > > > > > znodeVersion, payload); > >> > >> > > > > >> >> > > > > > > } > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > transitionZKNode() calls > tickleNodeSplit() > >> > when > >> > >> > > waiting > >> > >> > > > > for > >> > >> > > > > >> >> > master > >> > >> > > > > >> >> > > to > >> > >> > > > > >> >> > > > > > split > >> > >> > > > > >> >> > > > > > > the region. Obviously something caused > the > >> > >> master > >> > >> > not > >> > >> > > > > able > >> > >> > > > > >> to > >> > >> > > > > >> >> > > split. > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > How large is the region ? > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > Can you pastebin master log for that > >> period of > >> > >> > time ? > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > Thanks > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt > >> Corgan < > >> > >> > > > > >> >> [email protected] > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > > wrote: > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last > week > >> > and > >> > >> > have > >> > >> > > > > >> started to > >> > >> > > > > >> >> > > > > encounter > >> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on > >> > splits. > >> > >> > I'm > >> > >> > > > not > >> > >> > > > > >> sure > >> > >> > > > > >> >> > yet > >> > >> > > > > >> >> > > if > >> > >> > > > > >> >> > > > > > it's > >> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load. > >> > >> Solution > >> > >> > > so > >> > >> > > > > far > >> > >> > > > > >> has > >> > >> > > > > >> >> > been > >> > >> > > > > >> >> > > > to > >> > >> > > > > >> >> > > > > > > > restart the regionserver process. > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > log snippet: > >> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > It's repeating these two lines: > >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG > >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 > >> > >> Attempting > >> > >> > to > >> > >> > > > > >> transition > >> > >> > > > > >> >> > > node > >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from > >> > >> > > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> to > >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG > >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 > >> > >> Successfully > >> > >> > > > > >> >> transitioned > >> > >> > > > > >> >> > > node > >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from > >> > >> > > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> to > >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > with the occasional: > >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: > >> > >> > > > > >> Still > >> > >> > > > > >> >> > > > waiting > >> > >> > > > > >> >> > > > > on > >> > >> > > > > >> >> > > > > > > the > >> > >> > > > > >> >> > > > > > > > master to process the split for > >> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > Should the region transition from > >> > >> > > RS_ZK_REGION_SPLIT > >> > >> > > > to > >> > >> > > > > >> >> itself? > >> > >> > > > > >> >> > > It > >> > >> > > > > >> >> > > > > > looks > >> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the > >> code at > >> > >> all. > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > Thanks, > >> > >> > > > > >> >> > > > > > > > Matt > >> > >> > > > > >> >> > > > > > > > > >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> > > > > > >
