Strangely, I don't see any record of that region in the master before what
I already pasted even though I have logs back to 10/30.  Next time it
happens I'll gather a full log record and try to debug while it's occurring.


On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
[email protected]> wrote:

> Hi Matt,
> can you paste some more master logs of region
> bc62a8a72124a4ba3f6b9f30258790 before split.
> I think Its not problem with splitting.
> We are getting
>       LOG.warn("Region " + encodedName + " not found on server " +
> serverName +
>         "; failed processing");
> this log means no entry in servers map(not fully assigned).
>     Set<HRegionInfo> hris = this.servers.get(sn);
>     HRegionInfo foundHri = null;
>     for (HRegionInfo hri: hris) {
>       if (hri.getEncodedName().equals(encodedName)) {
>         foundHri = hri;
>         break;
>       }
>     }
>     return foundHri;
>
>
>
>
> On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[email protected]> wrote:
>
> > CC'ing dev list...
> >
> > Is anybody aware of any changes that went in recently that could cause
> > this?
> > I looked around a bit, but could not find anything obvious.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Matt Corgan <[email protected]>
> > To: user <[email protected]>
> > Sent: Saturday, November 3, 2012 5:27 PM
> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> >
> > I think the cluster is ok without running hbck, as restarting the
> > regionserver process stops the warnings and everything looks ok
> otherwise.
> >
> > here's the regionserver right after the split happens:
> > ------------------------
> > 2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > 2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > 2012-11-01 22:45:28,730 DEBUG
> > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on
> the
> > master to process the split for bc62a8a72124a4ba3f6b9f302587903c
> > 2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > 2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node
> > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > -----------------------------
> >
> > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
> > continues for 15 or so hours and finally settles without manual
> > intervention with these regionserver log messages:
> > -----------------------
> > 2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Attempting to transition node *
> > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > RS_ZK_REGION_SPLIT
> >
> > 2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13ab46479832953 Successfully transitioned node *
> > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
> > RS_ZK_REGION_SPLIT
> >
> > 2012-11-02 13:55:00,916 INFO
> > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > updated, and report to master.
> >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > Split took 15hrs, 9mins, 33sec
> >
> > 2012-11-02 13:55:00,945 INFO
> > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
> > updated, and report to master.
> >
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > new regions:
> >
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
> > ----------------------
> >
> > And the master finally logs this:
> > -----------------------
> > 2012-11-02 13:55:00,783 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > bc62a8a72124a4ba3f6b9f302587903c not found on server
> > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing
> >
> > 2012-11-02 13:55:00,783 WARN
> > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> region
> > bc62a8a72124a4ba3f6b9f302587903c from server
> > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> anymore,
> > probably already processed its split
> > ------------------------
> >
> > I can't find any evidence that the region left the node during that time.
> > I'll have to catch it in action next time and see what the region is up
> to
> > during the problem period.
> >
> > What does it mean that it successfully transitioned from SPLIT to SPLIT?
> > Is that a valid transition?
> >
> > Matt
> >
> >
> >
> > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <[email protected]> wrote:
> >
> > > Matt:
> > > From the following we can see that region
> > bc62a8a72124a4ba3f6b9f302587903c
> > > cannot be found:
> > >
> > > 2012-11-02 00:00:02,909 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Handling
> > > transition=RS_ZK_REGION_SPLIT,
> > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > > region=bc62a8a72124a4ba3f6b9f302587903c
> > > 2012-11-02 00:00:02,909 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> > > bc62a8a72124a4ba3f6b9f302587903c *not found on server
> > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > > processing
> > > 2012-11-02 00:00:02,909 WARN
> > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for
> > region
> > > bc62a8a72124a4ba3f6b9f302587903c from server
> > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist
> > anymore,
> > > probably already processed its split
> > >
> > > Have you run hbck to repair your cluster ?
> > >
> > > Thanks
> > >
> > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <[email protected]>
> wrote:
> > >
> > > > Here's a sample of the master's logs from yesterday.  It's not
> > correlated
> > > > exactly with the other pastebin log, but there's 3GB of this from
> > > > yesterday: http://pastebin.com/wP2rNN1t
> > > >
> > > > I'm am pushing the cluster a bit with importing data so testing the
> > split
> > > > code harder than normal.  The regions are 500-1GB gzipped.  I can
> look
> > > into
> > > > it more but trying to figure out what to look for.
> > > >
> > > > Thanks Ted,
> > > > Matt
> > > >
> > > >
> > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <[email protected]> wrote:
> > > >
> > > > > Matt:
> > > > > This is the method which made the logging:
> > > > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > > > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName
> > > > > serverName,
> > > > >       final int znodeVersion)
> > > > >   throws KeeperException, IOException {
> > > > >     byte [] payload = Writables.getBytes(a, b);
> > > > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > > > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > > > >       znodeVersion, payload);
> > > > >   }
> > > > >
> > > > > transitionZKNode() calls tickleNodeSplit() when waiting for master
> to
> > > > split
> > > > > the region. Obviously something caused the master not able to
> split.
> > > > >
> > > > > How large is the region ?
> > > > >
> > > > > Can you pastebin master log for that period of time ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <[email protected]>
> > > wrote:
> > > > >
> > > > > > We upgraded from .94.0 to .94.2 last week and have started to
> > > encounter
> > > > > > infinite loops of region-transition on splits.  I'm not sure yet
> if
> > > > it's
> > > > > > all splits nor if it's related to load.  Solution so far has been
> > to
> > > > > > restart the regionserver process.
> > > > > >
> > > > > > log snippet:
> > > > > > http://pastebin.com/LpienZ7B
> > > > > >
> > > > > > It's repeating these two lines:
> > > > > > 2012-11-02 01:35:33,312 DEBUG
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition
> node
> > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > RS_ZK_REGION_SPLIT
> > > > > > 2012-11-02 01:35:33,364 DEBUG
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned
> node
> > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > > > RS_ZK_REGION_SPLIT
> > > > > >
> > > > > > with the occasional:
> > > > > > 2012-11-02 01:35:34,476 DEBUG
> > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still
> > waiting
> > > on
> > > > > the
> > > > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf
> > > > > >
> > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?
>  It
> > > > looks
> > > > > > wrong, but I'm not familiar with the code at all.
> > > > > >
> > > > > > Thanks,
> > > > > > Matt
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to