Forgot to answer Ted's question - I don't remember exactly what we did to the cluster. I don't think there were any regionserver crashes, but there were probably controlled restarts, new servers added to the running cluster, and some regions moved by our manual balancer job.
On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[email protected]> wrote: > I've had a few successful splits today without any errors. I'll turn up > the importer speed tomorrow to start stressing the cluster more. > > > > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[email protected]> wrote: > >> Matt: >> Was there any region server crash before you noticed the repeated log for >> bc62a8a72124a4ba3f6b9f302587903c ? >> >> I wonder if the following JIRA might be related: >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter >> >> Thanks >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[email protected]> wrote: >> >> > Strangely, I don't see any record of that region in the master before >> what >> > I already pasted even though I have logs back to 10/30. Next time it >> > happens I'll gather a full log record and try to debug while it's >> > occurring. >> > >> > >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < >> > [email protected]> wrote: >> > >> > > Hi Matt, >> > > can you paste some more master logs of region >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> > > I think Its not problem with splitting. >> > > We are getting >> > > LOG.warn("Region " + encodedName + " not found on server " + >> > > serverName + >> > > "; failed processing"); >> > > this log means no entry in servers map(not fully assigned). >> > > Set<HRegionInfo> hris = this.servers.get(sn); >> > > HRegionInfo foundHri = null; >> > > for (HRegionInfo hri: hris) { >> > > if (hri.getEncodedName().equals(encodedName)) { >> > > foundHri = hri; >> > > break; >> > > } >> > > } >> > > return foundHri; >> > > >> > > >> > > >> > > >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[email protected]> >> > wrote: >> > > >> > > > CC'ing dev list... >> > > > >> > > > Is anybody aware of any changes that went in recently that could >> cause >> > > > this? >> > > > I looked around a bit, but could not find anything obvious. >> > > > >> > > > -- Lars >> > > > >> > > > >> > > > >> > > > ________________________________ >> > > > From: Matt Corgan <[email protected]> >> > > > To: user <[email protected]> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > > > >> > > > I think the cluster is ok without running hbck, as restarting the >> > > > regionserver process stops the warnings and everything looks ok >> > > otherwise. >> > > > >> > > > here's the regionserver right after the split happens: >> > > > ------------------------ >> > > > 2012-11-01 22:45:28,726 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* >> > > > 2012-11-01 22:45:28,730 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT >> > > > 2012-11-01 22:45:28,730 DEBUG >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still >> waiting on >> > > the >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c >> > > > 2012-11-01 22:45:28,832 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT >> > > > 2012-11-01 22:45:28,837 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT >> > > > ----------------------------- >> > > > >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to >> RS_ZK_REGION_SPLIT" >> > > > continues for 15 or so hours and finally settles without manual >> > > > intervention with these regionserver log messages: >> > > > ----------------------- >> > > > 2012-11-02 13:55:00,906 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node * >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to >> > > > RS_ZK_REGION_SPLIT >> > > > >> > > > 2012-11-02 13:55:00,916 DEBUG >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node >> * >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to >> > > > RS_ZK_REGION_SPLIT >> > > > >> > > > 2012-11-02 13:55:00,916 INFO >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, >> META >> > > > updated, and report to master. >> > > > >> > > > >> > > >> > >> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: >> > > > >> > > > >> > > >> > >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., >> > > > >> > > > >> > > >> > >> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. >> > > > Split took 15hrs, 9mins, 33sec >> > > > >> > > > 2012-11-02 13:55:00,945 INFO >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, >> META >> > > > updated, and report to master. >> > > > >> > > > >> > > >> > >> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., >> > > > new regions: >> > > > >> > > > >> > > >> > >> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., >> > > > >> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec >> > > > ---------------------- >> > > > >> > > > And the master finally logs this: >> > > > ----------------------- >> > > > 2012-11-02 13:55:00,783 WARN >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing >> > > > >> > > > 2012-11-02 13:55:00,783 WARN >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for >> > > region >> > > > bc62a8a72124a4ba3f6b9f302587903c from server >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist >> > > anymore, >> > > > probably already processed its split >> > > > ------------------------ >> > > > >> > > > I can't find any evidence that the region left the node during that >> > time. >> > > > I'll have to catch it in action next time and see what the region >> is up >> > > to >> > > > during the problem period. >> > > > >> > > > What does it mean that it successfully transitioned from SPLIT to >> > SPLIT? >> > > > Is that a valid transition? >> > > > >> > > > Matt >> > > > >> > > > >> > > > >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <[email protected]> wrote: >> > > > >> > > > > Matt: >> > > > > From the following we can see that region >> > > > bc62a8a72124a4ba3f6b9f302587903c >> > > > > cannot be found: >> > > > > >> > > > > 2012-11-02 00:00:02,909 DEBUG >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling >> > > > > transition=RS_ZK_REGION_SPLIT, >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279, >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c >> > > > > 2012-11-02 00:00:02,909 WARN >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed >> > > > > processing >> > > > > 2012-11-02 00:00:02,909 WARN >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT >> for >> > > > region >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist >> > > > anymore, >> > > > > probably already processed its split >> > > > > >> > > > > Have you run hbck to repair your cluster ? >> > > > > >> > > > > Thanks >> > > > > >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <[email protected]> >> > > wrote: >> > > > > >> > > > > > Here's a sample of the master's logs from yesterday. It's not >> > > > correlated >> > > > > > exactly with the other pastebin log, but there's 3GB of this >> from >> > > > > > yesterday: http://pastebin.com/wP2rNN1t >> > > > > > >> > > > > > I'm am pushing the cluster a bit with importing data so testing >> the >> > > > split >> > > > > > code harder than normal. The regions are 500-1GB gzipped. I >> can >> > > look >> > > > > into >> > > > > > it more but trying to figure out what to look for. >> > > > > > >> > > > > > Thanks Ted, >> > > > > > Matt >> > > > > > >> > > > > > >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <[email protected]> >> > wrote: >> > > > > > >> > > > > > > Matt: >> > > > > > > This is the method which made the logging: >> > > > > > > private static int tickleNodeSplit(ZooKeeperWatcher zkw, >> > > > > > > HRegionInfo parent, HRegionInfo a, HRegionInfo b, >> > ServerName >> > > > > > > serverName, >> > > > > > > final int znodeVersion) >> > > > > > > throws KeeperException, IOException { >> > > > > > > byte [] payload = Writables.getBytes(a, b); >> > > > > > > return ZKAssign.transitionNode(zkw, parent, serverName, >> > > > > > > EventType.RS_ZK_REGION_SPLIT, >> EventType.RS_ZK_REGION_SPLIT, >> > > > > > > znodeVersion, payload); >> > > > > > > } >> > > > > > > >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for >> > master >> > > to >> > > > > > split >> > > > > > > the region. Obviously something caused the master not able to >> > > split. >> > > > > > > >> > > > > > > How large is the region ? >> > > > > > > >> > > > > > > Can you pastebin master log for that period of time ? >> > > > > > > >> > > > > > > Thanks >> > > > > > > >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan < >> [email protected] >> > > >> > > > > wrote: >> > > > > > > >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have started >> to >> > > > > encounter >> > > > > > > > infinite loops of region-transition on splits. I'm not sure >> > yet >> > > if >> > > > > > it's >> > > > > > > > all splits nor if it's related to load. Solution so far has >> > been >> > > > to >> > > > > > > > restart the regionserver process. >> > > > > > > > >> > > > > > > > log snippet: >> > > > > > > > http://pastebin.com/LpienZ7B >> > > > > > > > >> > > > > > > > It's repeating these two lines: >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to >> transition >> > > node >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to >> > > > > > > > RS_ZK_REGION_SPLIT >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully >> transitioned >> > > node >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to >> > > > > > > > RS_ZK_REGION_SPLIT >> > > > > > > > >> > > > > > > > with the occasional: >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still >> > > > waiting >> > > > > on >> > > > > > > the >> > > > > > > > master to process the split for >> > cf3e9bc069e1888983c06dc8e053ffcf >> > > > > > > > >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to >> itself? >> > > It >> > > > > > looks >> > > > > > > > wrong, but I'm not familiar with the code at all. >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Matt >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
