I dug back further to find the origin of e11e8b030897d6e5b973f8fe892e0eb2
to see if it had previous problems that left ZK in a bad state.  Here's the
regionserver and master logs: http://pastebin.com/qcvHjNCg from about 2
hours earlier.

* Nov 4, 22:34: region is created as daugher b of a split on node 159
* Nov 4, 22:35: moved to from node 159 to 169 by HBaseAdmin.move()
* Nov 5, 00:24: node 169 tries to split the region but gets Failed create
of ephemeral /hbase/unassigned/e11e8b030897d6e5b973f8fe892e0eb2

Is it possible that if something calls HBaseAdmin.move() on a daugher
region that is 30 seconds old, it could move the region but leave that ZK
node in a bad state?



On Tue, Nov 6, 2012 at 7:13 AM, lars hofhansl <[email protected]> wrote:

> Thanks Ram!
>
>
>
>
>
> ----- Original Message -----
> From: ramkrishna vasudevan <[email protected]>
> To: [email protected]
> Cc:
> Sent: Tuesday, November 6, 2012 3:29 AM
> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
>
> Raised HBASE-7103 for the same.
>
> Regards
> Ram
>
> On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan <
> [email protected]> wrote:
>
> > Thanks for the logs.
> > I found the reason.
> >
> > The following steps happen
> > -> Initially the parent region P1 starts splitting.
> > -> The split is going on normally.
> > -> Another split starts at the same time for the same region P1. (Not
> sure
> > why this started).
> > -> Rollback happens seeing an already existing node.
> > -> This node gets deleted in rollback and nodeDeleted Event starts.
> > -> In nodeDeleted event the RIT for the region P1 gets deleted.
> > -> Because of this there is no region in RIT.
> > -> Now the first split gets over.  Here the problem is we try to transit
> > the node to SPLITTING to SPLIT. But the node even does not exist.
> > But we don take any action on this.  We think it is successful.
> > -> Even before HBASE-6854 this could have happened.  Will file a JIRA for
> > the same.
> >
> > Regards
> > Ram
> >
> > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <[email protected]> wrote:
> >
> >> Ram, here's the master log corresponding to
> http://pastebin.com/cSdMbA2a.
> >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
> >> regionserver in question (node 169), so i'm guessing that's
> >> 22f8fa73d8af837410ca270f344f6bb8's mommy.
> >>
> >> btw - you can see my balancer kick in 45 seconds later (runs every 10
> >> minutes) here, but so far i think that's coincidence:
> >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
> >> BalanceSwitch=false
> >>
> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node
> 169
> >> and found all this stuff about a failed split:
> >> http://pastebin.com/xtXMZ388 and
> >> an attempted rollback.  Looks like it errors out when it goes to put a
> >> node
> >> in ZK but it's already there.  I'm not familiar with what a good split
> log
> >> looks like, so i'll stop commenting for now...
> >>
> >>
> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
> >> [email protected]> wrote:
> >>
> >> > The log shows that the first time the region was transitioned to
> >> SPLITTING
> >> > even then it was not populated in the Master's memory.
> >> >
> >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> >> > [email protected]> wrote:
> >> >
> >> > > Could you attach the master logs at this time
> >> > > 2012-11-05 00:24:55?
> >> > >
> >> > > Regards
> >> > > Ram
> >> > >
> >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <[email protected]
> >> > >wrote:
> >> > >
> >> > >> Took a brief look through all SPLIT related commits since 0.94.0...
> >> > Found
> >> > >> these:
> >> > >>
> >> > >> HBASE-6854 *
> >> > >> HBASE-6713
> >> > >> HBASE-6329 *
> >> > >>
> >> > >> HBASE-6088
> >> > >>
> >> > >> HBASE-5986
> >> > >> HBASE-6070 *
> >> > >>
> >> > >>
> >> > >> The ones marked with * are (IMHO) more likely to be related.
> >> > >>
> >> > >> -- Lars
> >> > >>
> >> > >> ________________________________
> >> > >> From: Matt Corgan <[email protected]>
> >> > >> To: dev <[email protected]>; lars hofhansl <[email protected]
> >
> >> > >> Sent: Monday, November 5, 2012 9:28 PM
> >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> > >>
> >> > >> Yeah - we were running .94.0 since it came out but never saw it
> >> there.
> >> > >> I'll keep trying to narrow it down.  The only harm it's causing is
> >> log
> >> > >> spam and failing to move daughters to a new regionserver, which are
> >> > >> definitely problems, but it's not bringing down the cluster.
> >> > >>
> >> > >>
> >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <[email protected]
> >
> >> > >> wrote:
> >> > >>
> >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
> >> have
> >> > >> > never seen this before?
> >> > >> >
> >> > >> >
> >> > >> > -- Lars
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > ________________________________
> >> > >> >  From: Matt Corgan <[email protected]>
> >> > >> > To: dev <[email protected]>
> >> > >> > Sent: Monday, November 5, 2012 9:10 PM
> >> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >> > >> >
> >> > >> > It happened in this new table that I have all the logs for.  The
> >> > region
> >> > >> in
> >> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
> >> first
> >> > >> trace
> >> > >> > of this region is on the regionserver with the "Intantiated
> >> > >> tableName..."
> >> > >> > message
> >> > >> >
> >> > >> > 2012-11-05 22:24:21,162 DEBUG
> >> > >> org.apache.hadoop.hbase.regionserver.HRegion:
> >> > >> > Instantiated
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> >> > >> > 9784c21cbc2.
> >> > >> >
> >> > >> > I also know this region came from a recent split, but I can't
> find
> >> any
> >> > >> log
> >> > >> > messages show the parent finishing the split that created this
> >> > daughter
> >> > >> > region.  So my guess now is that the split is actually finishing
> >> and
> >> > >> > letting clients continue to write data, but something is failing
> to
> >> > >> print
> >> > >> > the log line and correctly tell the master about the new region.
> >> > >> >
> >> > >> > I've noticed that these regions are showing up on the clients in
> >> calls
> >> > >> to
> >> > >> > HTable.getRegionLocations(), so the clients continue to function,
> >> but
> >> > >> if I
> >> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
> >> > >> >
> >> > >> >
> >> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <[email protected]>
> >> wrote:
> >> > >> >
> >> > >> > > I think what Matt encountered was this:
> >> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
> >> > >> > >
> >> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <
> [email protected]
> >> >
> >> > >> wrote:
> >> > >> > >
> >> > >> > > > Beats me - i can't find any record of it before that.
> >> > >> > > >
> >> > >> > > > I'm importing data into another table now.  I
> disabled/enabled
> >> the
> >> > >> > table
> >> > >> > > > first to make sure we have the original 4 region locations
> >> logged
> >> > >> > > > everywhere.  Will report back...
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> >> > >> > [email protected]
> >> > >> > > > >wrote:
> >> > >> > > >
> >> > >> > > > > Wow, and can you figure how this happened?
> >> > >> > > > >
> >> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
> >> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> >> > >> > > > >
> >> > >> > > > > Split? Else how did the master told that RS to open a
> region
> >> > that
> >> > >> > > doesn't
> >> > >> > > > > exist?
> >> > >> > > > >
> >> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
> >> > [email protected]>
> >> > >> > > wrote:
> >> > >> > > > > > Here's one from last night with the master logs at the
> >> bottom.
> >> > >> > > > > > http://pastebin.com/cSdMbA2a
> >> > >> > > > > >
> >> > >> > > > > > I don't see that region in the master logs for 4 days
> back.
> >> >  I'm
> >> > >> > > going
> >> > >> > > > to
> >> > >> > > > > > import some new data from scratch soon and will be sure
> to
> >> > keep
> >> > >> all
> >> > >> > > the
> >> > >> > > > > > master logs.
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> >> > >> > > > [email protected]
> >> > >> > > > > >wrote:
> >> > >> > > > > >
> >> > >> > > > > >> This reminds me a lot of
> >> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> >> > >> > > > > >>
> >> > >> > > > > >> What I'd like to see is the log from the first time the
> >> > master
> >> > >> > > > > >> receives the split message. I guess it says the region
> >> > doesn't
> >> > >> > exist
> >> > >> > > > > >> anymore because the split was processed already in the
> >> master
> >> > >> but
> >> > >> > > > > >> there's a failure mode similar to 4792.
> >> > >> > > > > >>
> >> > >> > > > > >> I saw this on another cluster last week but the logs
> were
> >> > >> rolling
> >> > >> > > > > >> based on size so the original split message was lost.
> >> > >> > > > > >>
> >> > >> > > > > >> J-D
> >> > >> > > > > >>
> >> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> >> > >> [email protected]>
> >> > >> > > > > wrote:
> >> > >> > > > > >> > I've had a few successful splits today without any
> >> errors.
> >> > >> I'll
> >> > >> > > > turn
> >> > >> > > > > up
> >> > >> > > > > >> > the importer speed tomorrow to start stressing the
> >> cluster
> >> > >> more.
> >> > >> > > > > >> >
> >> > >> > > > > >> >
> >> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
> >> > [email protected]>
> >> > >> > > wrote:
> >> > >> > > > > >> >
> >> > >> > > > > >> >> Matt:
> >> > >> > > > > >> >> Was there any region server crash before you noticed
> >> the
> >> > >> > repeated
> >> > >> > > > log
> >> > >> > > > > >> for
> >> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> I wonder if the following JIRA might be related:
> >> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
> >> > missing
> >> > >> > > > daughter
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> Thanks
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> >> > >> > [email protected]
> >> > >> > > >
> >> > >> > > > > >> wrote:
> >> > >> > > > > >> >>
> >> > >> > > > > >> >> > Strangely, I don't see any record of that region in
> >> the
> >> > >> > master
> >> > >> > > > > before
> >> > >> > > > > >> >> what
> >> > >> > > > > >> >> > I already pasted even though I have logs back to
> >> 10/30.
> >> > >> Next
> >> > >> > > > time
> >> > >> > > > > it
> >> > >> > > > > >> >> > happens I'll gather a full log record and try to
> >> debug
> >> > >> while
> >> > >> > > it's
> >> > >> > > > > >> >> > occurring.
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
> >> > chintaguntla <
> >> > >> > > > > >> >> > [email protected]> wrote:
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >> > > Hi Matt,
> >> > >> > > > > >> >> > > can you paste some more master logs of region
> >> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> > >> > > > > >> >> > > I think Its not problem with splitting.
> >> > >> > > > > >> >> > > We are getting
> >> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not
> >> found
> >> > on
> >> > >> > > server
> >> > >> > > > "
> >> > >> > > > > +
> >> > >> > > > > >> >> > > serverName +
> >> > >> > > > > >> >> > >         "; failed processing");
> >> > >> > > > > >> >> > > this log means no entry in servers map(not fully
> >> > >> assigned).
> >> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
> >> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> >> > >> > > > > >> >> > >       if
> >> (hri.getEncodedName().equals(encodedName)) {
> >> > >> > > > > >> >> > >         foundHri = hri;
> >> > >> > > > > >> >> > >         break;
> >> > >> > > > > >> >> > >       }
> >> > >> > > > > >> >> > >     }
> >> > >> > > > > >> >> > >     return foundHri;
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> >> > >> > > > > [email protected]>
> >> > >> > > > > >> >> > wrote:
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > > CC'ing dev list...
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
> >> > recently
> >> > >> > that
> >> > >> > > > > could
> >> > >> > > > > >> >> cause
> >> > >> > > > > >> >> > > > this?
> >> > >> > > > > >> >> > > > I looked around a bit, but could not find
> >> anything
> >> > >> > obvious.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > -- Lars
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > ________________________________
> >> > >> > > > > >> >> > > >  From: Matt Corgan <[email protected]>
> >> > >> > > > > >> >> > > > To: user <[email protected]>
> >> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> > >> > > > > >> >> > > > Subject: Re: infinite loop of
> RS_ZK_REGION_SPLIT
> >> on
> >> > >> .94.2
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > I think the cluster is ok without running hbck,
> >> as
> >> > >> > > restarting
> >> > >> > > > > the
> >> > >> > > > > >> >> > > > regionserver process stops the warnings and
> >> > everything
> >> > >> > > looks
> >> > >> > > > ok
> >> > >> > > > > >> >> > > otherwise.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > here's the regionserver right after the split
> >> > happens:
> >> > >> > > > > >> >> > > > ------------------------
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> > >> > > > > >> >> > > >
> >> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > >> > > Still
> >> > >> > > > > >> waiting
> >> > >> > > > > >> >> on
> >> > >> > > > > >> >> > > the
> >> > >> > > > > >> >> > > > master to process the split for
> >> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > -----------------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT
> to
> >> > >> > > > > >> RS_ZK_REGION_SPLIT"
> >> > >> > > > > >> >> > > > continues for 15 or so hours and finally
> settles
> >> > >> without
> >> > >> > > > manual
> >> > >> > > > > >> >> > > > intervention with these regionserver log
> >> messages:
> >> > >> > > > > >> >> > > > -----------------------
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting
> >> to
> >> > >> > > transition
> >> > >> > > > > >> node *
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> > >> RS_ZK_REGION_SPLIT
> >> > >> > > to
> >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953
> Successfully
> >> > >> > > > transitioned
> >> > >> > > > > >> node *
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> >> > >> RS_ZK_REGION_SPLIT
> >> > >> > > to
> >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> >> > >> > > > > >> >> > > >
> >> org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> > >> Region
> >> > >> > > > > split,
> >> > >> > > > > >> META
> >> > >> > > > > >> >> > > > updated, and report to master.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new
> regions:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> >> > >> > > > > >> >> > > >
> >> org.apache.hadoop.hbase.regionserver.SplitRequest:
> >> > >> Region
> >> > >> > > > > split,
> >> > >> > > > > >> META
> >> > >> > > > > >> >> > > > updated, and report to master.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> > >> > > > > >> >> > > > new regions:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >>
> >> > >> > > >
> >> > >> >
> >> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
> >> > 17hrs,
> >> > >> > > 14mins,
> >> > >> > > > > 2sec
> >> > >> > > > > >> >> > > > ----------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > And the master finally logs this:
> >> > >> > > > > >> >> > > > -----------------------
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > >> > > > > >> >> > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> Region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
> >> server
> >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
> >> > failed
> >> > >> > > > > processing
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> > >> > > > > >> >> > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > Received
> >> > >> > > > > SPLIT
> >> > >> > > > > >> for
> >> > >> > > > > >> >> > > region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279
> >> but it
> >> > >> > > doesn't
> >> > >> > > > > exist
> >> > >> > > > > >> >> > > anymore,
> >> > >> > > > > >> >> > > > probably already processed its split
> >> > >> > > > > >> >> > > > ------------------------
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > I can't find any evidence that the region left
> >> the
> >> > >> node
> >> > >> > > > during
> >> > >> > > > > >> that
> >> > >> > > > > >> >> > time.
> >> > >> > > > > >> >> > > > I'll have to catch it in action next time and
> see
> >> > what
> >> > >> > the
> >> > >> > > > > region
> >> > >> > > > > >> is
> >> > >> > > > > >> >> up
> >> > >> > > > > >> >> > > to
> >> > >> > > > > >> >> > > > during the problem period.
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > What does it mean that it successfully
> >> transitioned
> >> > >> from
> >> > >> > > > SPLIT
> >> > >> > > > > to
> >> > >> > > > > >> >> > SPLIT?
> >> > >> > > > > >> >> > > > Is that a valid transition?
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > Matt
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> >> > >> > > [email protected]>
> >> > >> > > > > >> wrote:
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > > > > Matt:
> >> > >> > > > > >> >> > > > > From the following we can see that region
> >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > > cannot be found:
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > > Handling
> >> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> > > > >
> >> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > Region
> >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found
> on
> >> > >> server
> >> > >> > > > > >> >> > > > >
> >> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> > >> > > > > >> >> > > > > processing
> >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> > >> > > > > >> >> > > > >
> >> org.apache.hadoop.hbase.master.AssignmentManager:
> >> > >> > > Received
> >> > >> > > > > SPLIT
> >> > >> > > > > >> >> for
> >> > >> > > > > >> >> > > > region
> >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
> >> but
> >> > it
> >> > >> > > > doesn't
> >> > >> > > > > >> exist
> >> > >> > > > > >> >> > > > anymore,
> >> > >> > > > > >> >> > > > > probably already processed its split
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > Thanks
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> >> > >> > > > > >> [email protected]>
> >> > >> > > > > >> >> > > wrote:
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
> >> > >> yesterday.
> >> > >> > > >  It's
> >> > >> > > > > not
> >> > >> > > > > >> >> > > > correlated
> >> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
> >> there's
> >> > >> 3GB
> >> > >> > of
> >> > >> > > > > this
> >> > >> > > > > >> from
> >> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
> >> importing
> >> > >> data
> >> > >> > so
> >> > >> > > > > >> testing
> >> > >> > > > > >> >> the
> >> > >> > > > > >> >> > > > split
> >> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
> >> > 500-1GB
> >> > >> > > > gzipped.
> >> > >> > > > >  I
> >> > >> > > > > >> can
> >> > >> > > > > >> >> > > look
> >> > >> > > > > >> >> > > > > into
> >> > >> > > > > >> >> > > > > > it more but trying to figure out what to
> look
> >> > for.
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > Thanks Ted,
> >> > >> > > > > >> >> > > > > > Matt
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> >> > >> > > > > [email protected]>
> >> > >> > > > > >> >> > wrote:
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > > > > Matt:
> >> > >> > > > > >> >> > > > > > > This is the method which made the
> logging:
> >> > >> > > > > >> >> > > > > > >   private static int
> >> > >> > tickleNodeSplit(ZooKeeperWatcher
> >> > >> > > > > zkw,
> >> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> >> > >> > HRegionInfo
> >> > >> > > b,
> >> > >> > > > > >> >> > ServerName
> >> > >> > > > > >> >> > > > > > > serverName,
> >> > >> > > > > >> >> > > > > > >       final int znodeVersion)
> >> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> >> > >> > > > > >> >> > > > > > >     byte [] payload =
> Writables.getBytes(a,
> >> > b);
> >> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
> >> > parent,
> >> > >> > > > > serverName,
> >> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> >> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
> >> > >> > > > > >> >> > > > > > >   }
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > transitionZKNode() calls
> tickleNodeSplit()
> >> > when
> >> > >> > > waiting
> >> > >> > > > > for
> >> > >> > > > > >> >> > master
> >> > >> > > > > >> >> > > to
> >> > >> > > > > >> >> > > > > > split
> >> > >> > > > > >> >> > > > > > > the region. Obviously something caused
> the
> >> > >> master
> >> > >> > not
> >> > >> > > > > able
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > split.
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > How large is the region ?
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > Can you pastebin master log for that
> >> period of
> >> > >> > time ?
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > Thanks
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt
> >> Corgan <
> >> > >> > > > > >> >> [email protected]
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> > > > > wrote:
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last
> week
> >> > and
> >> > >> > have
> >> > >> > > > > >> started to
> >> > >> > > > > >> >> > > > > encounter
> >> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
> >> > splits.
> >> > >> > I'm
> >> > >> > > > not
> >> > >> > > > > >> sure
> >> > >> > > > > >> >> > yet
> >> > >> > > > > >> >> > > if
> >> > >> > > > > >> >> > > > > > it's
> >> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> >> > >> Solution
> >> > >> > > so
> >> > >> > > > > far
> >> > >> > > > > >> has
> >> > >> > > > > >> >> > been
> >> > >> > > > > >> >> > > > to
> >> > >> > > > > >> >> > > > > > > > restart the regionserver process.
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > log snippet:
> >> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> > >> Attempting
> >> > >> > to
> >> > >> > > > > >> transition
> >> > >> > > > > >> >> > > node
> >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > >> > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> >> > >> Successfully
> >> > >> > > > > >> >> transitioned
> >> > >> > > > > >> >> > > node
> >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> >> > >> > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> to
> >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > with the occasional:
> >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> > >> > > > > >> Still
> >> > >> > > > > >> >> > > > waiting
> >> > >> > > > > >> >> > > > > on
> >> > >> > > > > >> >> > > > > > > the
> >> > >> > > > > >> >> > > > > > > > master to process the split for
> >> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > Should the region transition from
> >> > >> > > RS_ZK_REGION_SPLIT
> >> > >> > > > to
> >> > >> > > > > >> >> itself?
> >> > >> > > > > >> >> > >  It
> >> > >> > > > > >> >> > > > > > looks
> >> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the
> >> code at
> >> > >> all.
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > > > Thanks,
> >> > >> > > > > >> >> > > > > > > > Matt
> >> > >> > > > > >> >> > > > > > > >
> >> > >> > > > > >> >> > > > > > >
> >> > >> > > > > >> >> > > > > >
> >> > >> > > > > >> >> > > > >
> >> > >> > > > > >> >> > > >
> >> > >> > > > > >> >> > >
> >> > >> > > > > >> >> >
> >> > >> > > > > >> >>
> >> > >> > > > > >>
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>
>

Reply via email to