[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190894#comment-13190894 ]
stack commented on HBASE-5179: ------------------------------ Here is some review of v17. logDirExists should be called getLogDir or getLogDirIfExists? Why is the below done up in ServerManager rather than down in ServerShutdownHandler? I'd think it'd make more sense therein? Perhaps even inside in the MetaServerShutdownHandler or whatever its called? Not a biggie. Just asking. Can we talk more about what there are: "+ * @param onlineServers onlined servers when master starts" Are these servers that have checked in between master start and the call to processFailover. Could other servers come between the making of the list we pass into processFailover and the running of the processFailover code? Should this be a 'live' list? Or, I see that we are actually getting rid of the 'live' list of online servers to replace it w/ this static one in these lines: {code} - } else if (!serverManager.isServerOnline(regionLocation.getServerName())) { + } else if (!onlineServers.contains(regionLocation.getServerName())) { {code} Why do we do this? Could this block of be code be done out in a nice coherent method rather than inline here in HMaster: {code} + // Check zk for regionservers that are up but didn't register + for (String sn : this.regionServerTracker.getOnlineServerNames()) { + if (!this.serverManager.isServerOnline(sn)) { + HServerInfo serverInfo = HServerInfo.getServerInfo(sn); + if (serverInfo != null) { + HServerInfo existingServer = serverManager + .haveServerWithSameHostAndPortAlready(serverInfo + .getHostnamePort()); + if (existingServer == null) { + // Not registered; add it. + LOG.info("Registering server found up in zk but who has not yet " + + "reported in: " + sn); + // We set serverLoad with one region, it could differentiate with + // regionserver which is started just now + HServerLoad serverLoad = new HServerLoad(); + serverLoad.setNumberOfRegions(1); + serverInfo.setLoad(serverLoad); + this.serverManager.recordNewServer(serverInfo, true, null); + } + } else { + LOG.warn("Server " + sn + + " found up in zk, but is not a correct server name"); + } + } + } {code} Can we say more about why we are doing this? And how do we know that it did not just start? Because if it has more than 0 regions, it must have been already running? {code} + if (rootServerLoad != null && rootServerLoad.getNumberOfRegions() > 0) { + // If rootServer is online && not start just now, we expire it + this.serverManager.expireServer(rootServerInfo); + } {code} It looks like the processing of the meta server duplicates code from the processing of the root server. Can we have instead the duplicated code out in a method? Is that possible? Then pass in args for whether root or meta to process? Regards the hard coded wait of 1000ms when looking for dir to go away, see '13.4.1.2.2. Sleeps in tests' in http://hbase.apache.org/book.html#hbase.tests... it seems like we should wait less than 1000 going by the reference guide. DeadServer class looks much better. I can't write a test for a patch on 0.90 and this v17 won't work for trunk at all. The trunk patch will be very different which is a problem because then I have no confidence my trunk unit test applies to this patch that is to be applied on 0.90 branch. > Concurrent processing of processFaileOver and ServerShutdownHandler may cause > region to be assigned before log splitting is completed, causing data loss > -------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-5179 > URL: https://issues.apache.org/jira/browse/HBASE-5179 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.2 > Reporter: chunhui shen > Assignee: chunhui shen > Priority: Critical > Fix For: 0.92.0, 0.94.0, 0.90.6 > > Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, > 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, > 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, > 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, > 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, > 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, > hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, > hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, > hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch > > > If master's processing its failover and ServerShutdownHandler's processing > happen concurrently, it may appear following case. > 1.master completed splitLogAfterStartup() > 2.RegionserverA restarts, and ServerShutdownHandler is processing. > 3.master starts to rebuildUserRegions, and RegionserverA is considered as > dead server. > 4.master starts to assign regions of RegionserverA because it is a dead > server by step3. > However, when doing step4(assigning region), ServerShutdownHandler may be > doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira