[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123377#comment-13123377 ]
ramkrishna.s.vasudevan commented on HBASE-4511: ----------------------------------------------- @Gao This problem occured in testcase. Can we reproduce this in real time? It would be great if we can reproduce so that we are clear of the actual problem? > There is data loss when master failovers > ---------------------------------------- > > Key: HBASE-4511 > URL: https://issues.apache.org/jira/browse/HBASE-4511 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.92.0 > Reporter: gaojinchao > Priority: Critical > Fix For: 0.92.0 > > Attachments: > org.apache.hadoop.hbase.master.TestMasterFailover-output.rar > > > It goes like this: > Master crashed , at the same time RS with meta is crashing, but RS doesn't > eixt. > Master startups again and finds all living RS. > Master verifies the meta failed, because this RS is crashing. > Master reassigns the meta, but it doesn't split the Hlog. > So some meta data is loss. > About the logs of a failover test case fail. > //It said that we want to kill a RS > 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): > STOPPED: Killing for unit test > 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): > RS 192.168.2.102,54385,1317264874629 killed > //Rs didn't crash. > 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] > master.HMaster(458): Registering server found up in zk: > 192.168.2.102,54385,1317264874629 > 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] > master.ServerManager(232): Registering > server=192.168.2.102,54385,1317264874629 > 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of > znode /hbase/unassigned/1028785192 because node does not exist (not an error) > 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) > of data from znode /hbase/root-region-server and set watcher; > 192.168.2.102,54383,131726487... > //Meta verification failed and ressigned the meta. So all the regions in the > meta is loss. > 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(476): Failed verification of .META.,,1 at > address=192.168.2.102,54385,1317264874629; > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server > 192.168.2.102,54385,1317264874629 not running, aborting > 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(316): new .META. server: > 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null > 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) > of data from znode /hbase/root-region-server and set watcher; > 192.168.2.102,54383,131726487... > 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(476): Failed verification of .META.,,1 at > address=192.168.2.102,54385,1317264874629; > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server > 192.168.2.102,54385,1317264874629 not running, aborting > 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(316): new .META. server: > 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null > 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) > of data from znode /hbase/root-region-server and set watcher; > 192.168.2.102,54383,131726487... > 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(476): Failed verification of .META.,,1 at > address=192.168.2.102,54385,1317264874629; > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: > org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server > 192.168.2.102,54385,1317264874629 not running, aborting > 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > catalog.CatalogTracker(316): new .META. server: > 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null > 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or > updating) unassigned node for 1028785192 with OFFLINE state > 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] > zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received > ZooKeeper Event, type=NodeCreated, state=SyncConnected, > path=/hbase/unassigned/1028785192 > //It said that Master clean the cluster. > 2011-09-28 19:54:52,889 INFO [Master:0;192.168.2.102,54557,1317264885720] > master.AssignmentManager(383): Clean cluster startup. Assigning userregions > 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] > zookeeper.ZKAssign(494): master:54557-0x132b31adbb30005 Deleting any existing > unassigned nodes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira