Wow, and can you figure how this happened? 2012-11-05 00:24:54,538 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
Split? Else how did the master told that RS to open a region that doesn't exist? On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <[email protected]> wrote: > Here's one from last night with the master logs at the bottom. > http://pastebin.com/cSdMbA2a > > I don't see that region in the master logs for 4 days back. I'm going to > import some new data from scratch soon and will be sure to keep all the > master logs. > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <[email protected]>wrote: > >> This reminds me a lot of https://issues.apache.org/jira/browse/HBASE-4792 >> >> What I'd like to see is the log from the first time the master >> receives the split message. I guess it says the region doesn't exist >> anymore because the split was processed already in the master but >> there's a failure mode similar to 4792. >> >> I saw this on another cluster last week but the logs were rolling >> based on size so the original split message was lost. >> >> J-D >> >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <[email protected]> wrote: >> > I've had a few successful splits today without any errors. I'll turn up >> > the importer speed tomorrow to start stressing the cluster more. >> > >> > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <[email protected]> wrote: >> > >> >> Matt: >> >> Was there any region server crash before you noticed the repeated log >> for >> >> bc62a8a72124a4ba3f6b9f302587903c ? >> >> >> >> I wonder if the following JIRA might be related: >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter >> >> >> >> Thanks >> >> >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <[email protected]> >> wrote: >> >> >> >> > Strangely, I don't see any record of that region in the master before >> >> what >> >> > I already pasted even though I have logs back to 10/30. Next time it >> >> > happens I'll gather a full log record and try to debug while it's >> >> > occurring. >> >> > >> >> > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < >> >> > [email protected]> wrote: >> >> > >> >> > > Hi Matt, >> >> > > can you paste some more master logs of region >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> >> > > I think Its not problem with splitting. >> >> > > We are getting >> >> > > LOG.warn("Region " + encodedName + " not found on server " + >> >> > > serverName + >> >> > > "; failed processing"); >> >> > > this log means no entry in servers map(not fully assigned). >> >> > > Set<HRegionInfo> hris = this.servers.get(sn); >> >> > > HRegionInfo foundHri = null; >> >> > > for (HRegionInfo hri: hris) { >> >> > > if (hri.getEncodedName().equals(encodedName)) { >> >> > > foundHri = hri; >> >> > > break; >> >> > > } >> >> > > } >> >> > > return foundHri; >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <[email protected]> >> >> > wrote: >> >> > > >> >> > > > CC'ing dev list... >> >> > > > >> >> > > > Is anybody aware of any changes that went in recently that could >> >> cause >> >> > > > this? >> >> > > > I looked around a bit, but could not find anything obvious. >> >> > > > >> >> > > > -- Lars >> >> > > > >> >> > > > >> >> > > > >> >> > > > ________________________________ >> >> > > > From: Matt Corgan <[email protected]> >> >> > > > To: user <[email protected]> >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> >> > > > >> >> > > > I think the cluster is ok without running hbck, as restarting the >> >> > > > regionserver process stops the warnings and everything looks ok >> >> > > otherwise. >> >> > > > >> >> > > > here's the regionserver right after the split happens: >> >> > > > ------------------------ >> >> > > > 2012-11-01 22:45:28,726 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* >> >> > > > 2012-11-01 22:45:28,730 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned >> node >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT >> >> > > > 2012-11-01 22:45:28,730 DEBUG >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still >> waiting >> >> on >> >> > > the >> >> > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c >> >> > > > 2012-11-01 22:45:28,832 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition node >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT >> >> > > > 2012-11-01 22:45:28,837 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned >> node >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT >> >> > > > ----------------------------- >> >> > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to >> RS_ZK_REGION_SPLIT" >> >> > > > continues for 15 or so hours and finally settles without manual >> >> > > > intervention with these regionserver log messages: >> >> > > > ----------------------- >> >> > > > 2012-11-02 13:55:00,906 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition >> node * >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to >> >> > > > RS_ZK_REGION_SPLIT >> >> > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned >> node * >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to >> >> > > > RS_ZK_REGION_SPLIT >> >> > > > >> >> > > > 2012-11-02 13:55:00,916 INFO >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, >> META >> >> > > > updated, and report to master. >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. >> >> > > > Split took 15hrs, 9mins, 33sec >> >> > > > >> >> > > > 2012-11-02 13:55:00,945 INFO >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, >> META >> >> > > > updated, and report to master. >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., >> >> > > > new regions: >> >> > > > >> >> > > > >> >> > > >> >> > >> >> >> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., >> >> > > > >> >> > >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec >> >> > > > ---------------------- >> >> > > > >> >> > > > And the master finally logs this: >> >> > > > ----------------------- >> >> > > > 2012-11-02 13:55:00,783 WARN >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing >> >> > > > >> >> > > > 2012-11-02 13:55:00,783 WARN >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT >> for >> >> > > region >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist >> >> > > anymore, >> >> > > > probably already processed its split >> >> > > > ------------------------ >> >> > > > >> >> > > > I can't find any evidence that the region left the node during >> that >> >> > time. >> >> > > > I'll have to catch it in action next time and see what the region >> is >> >> up >> >> > > to >> >> > > > during the problem period. >> >> > > > >> >> > > > What does it mean that it successfully transitioned from SPLIT to >> >> > SPLIT? >> >> > > > Is that a valid transition? >> >> > > > >> >> > > > Matt >> >> > > > >> >> > > > >> >> > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <[email protected]> >> wrote: >> >> > > > >> >> > > > > Matt: >> >> > > > > From the following we can see that region >> >> > > > bc62a8a72124a4ba3f6b9f302587903c >> >> > > > > cannot be found: >> >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling >> >> > > > > transition=RS_ZK_REGION_SPLIT, >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279, >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c >> >> > > > > 2012-11-02 00:00:02,909 WARN >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed >> >> > > > > processing >> >> > > > > 2012-11-02 00:00:02,909 WARN >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT >> >> for >> >> > > > region >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't >> exist >> >> > > > anymore, >> >> > > > > probably already processed its split >> >> > > > > >> >> > > > > Have you run hbck to repair your cluster ? >> >> > > > > >> >> > > > > Thanks >> >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan < >> [email protected]> >> >> > > wrote: >> >> > > > > >> >> > > > > > Here's a sample of the master's logs from yesterday. It's not >> >> > > > correlated >> >> > > > > > exactly with the other pastebin log, but there's 3GB of this >> from >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t >> >> > > > > > >> >> > > > > > I'm am pushing the cluster a bit with importing data so >> testing >> >> the >> >> > > > split >> >> > > > > > code harder than normal. The regions are 500-1GB gzipped. I >> can >> >> > > look >> >> > > > > into >> >> > > > > > it more but trying to figure out what to look for. >> >> > > > > > >> >> > > > > > Thanks Ted, >> >> > > > > > Matt >> >> > > > > > >> >> > > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <[email protected]> >> >> > wrote: >> >> > > > > > >> >> > > > > > > Matt: >> >> > > > > > > This is the method which made the logging: >> >> > > > > > > private static int tickleNodeSplit(ZooKeeperWatcher zkw, >> >> > > > > > > HRegionInfo parent, HRegionInfo a, HRegionInfo b, >> >> > ServerName >> >> > > > > > > serverName, >> >> > > > > > > final int znodeVersion) >> >> > > > > > > throws KeeperException, IOException { >> >> > > > > > > byte [] payload = Writables.getBytes(a, b); >> >> > > > > > > return ZKAssign.transitionNode(zkw, parent, serverName, >> >> > > > > > > EventType.RS_ZK_REGION_SPLIT, >> >> EventType.RS_ZK_REGION_SPLIT, >> >> > > > > > > znodeVersion, payload); >> >> > > > > > > } >> >> > > > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for >> >> > master >> >> > > to >> >> > > > > > split >> >> > > > > > > the region. Obviously something caused the master not able >> to >> >> > > split. >> >> > > > > > > >> >> > > > > > > How large is the region ? >> >> > > > > > > >> >> > > > > > > Can you pastebin master log for that period of time ? >> >> > > > > > > >> >> > > > > > > Thanks >> >> > > > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan < >> >> [email protected] >> >> > > >> >> > > > > wrote: >> >> > > > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week and have >> started to >> >> > > > > encounter >> >> > > > > > > > infinite loops of region-transition on splits. I'm not >> sure >> >> > yet >> >> > > if >> >> > > > > > it's >> >> > > > > > > > all splits nor if it's related to load. Solution so far >> has >> >> > been >> >> > > > to >> >> > > > > > > > restart the regionserver process. >> >> > > > > > > > >> >> > > > > > > > log snippet: >> >> > > > > > > > http://pastebin.com/LpienZ7B >> >> > > > > > > > >> >> > > > > > > > It's repeating these two lines: >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to >> transition >> >> > > node >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT >> to >> >> > > > > > > > RS_ZK_REGION_SPLIT >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully >> >> transitioned >> >> > > node >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT >> to >> >> > > > > > > > RS_ZK_REGION_SPLIT >> >> > > > > > > > >> >> > > > > > > > with the occasional: >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG >> >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: >> Still >> >> > > > waiting >> >> > > > > on >> >> > > > > > > the >> >> > > > > > > > master to process the split for >> >> > cf3e9bc069e1888983c06dc8e053ffcf >> >> > > > > > > > >> >> > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to >> >> itself? >> >> > > It >> >> > > > > > looks >> >> > > > > > > > wrong, but I'm not familiar with the code at all. >> >> > > > > > > > >> >> > > > > > > > Thanks, >> >> > > > > > > > Matt >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >>
