[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

stack (Commented) (JIRA) Sun, 22 Jan 2012 22:26:25 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190894#comment-13190894
 ]


stack commented on HBASE-5179:
------------------------------

Here is some review of v17.

logDirExists should be called getLogDir or getLogDirIfExists?

Why is the below done up in ServerManager rather than down in 
ServerShutdownHandler?  I'd think it'd make more sense therein?  Perhaps even 
inside in the MetaServerShutdownHandler or whatever its called?   Not a biggie. 
 Just asking.

Can we talk more about what there are: "+   * @param onlineServers onlined 
servers when master starts"

Are these servers that have checked in between master start and the call to 
processFailover.  Could other servers come between the making of the list we 
pass into processFailover and the running of the processFailover code?  Should 
this be a 'live' list?   Or, I see that we are actually getting rid of the 
'live' list of online servers to replace it w/ this static one in these lines:

{code}
-      } else if 
(!serverManager.isServerOnline(regionLocation.getServerName())) {
+      } else if (!onlineServers.contains(regionLocation.getServerName())) {
{code}

Why do we do this?

Could this block of be code be done out in a nice coherent method rather than 
inline here in HMaster:

{code}
+    // Check zk for regionservers that are up but didn't register
+    for (String sn : this.regionServerTracker.getOnlineServerNames()) {
+      if (!this.serverManager.isServerOnline(sn)) {
+        HServerInfo serverInfo = HServerInfo.getServerInfo(sn);
+        if (serverInfo != null) {
+          HServerInfo existingServer = serverManager
+              .haveServerWithSameHostAndPortAlready(serverInfo
+                  .getHostnamePort());
+          if (existingServer == null) {
+            // Not registered; add it.
+            LOG.info("Registering server found up in zk but who has not yet "
+                + "reported in: " + sn);
+            // We set serverLoad with one region, it could differentiate with
+            // regionserver which is started just now
+            HServerLoad serverLoad = new HServerLoad();
+            serverLoad.setNumberOfRegions(1);
+            serverInfo.setLoad(serverLoad);
+            this.serverManager.recordNewServer(serverInfo, true, null);
+          }
+        } else {
+          LOG.warn("Server " + sn
+              + " found up in zk, but is not a correct server name");
+        }
+      }
+    }
{code}

Can we say more about why we are doing this?  And how do we know that it did 
not just start?  Because if it has more than 0 regions, it must have been 
already running?

{code}
+        if (rootServerLoad != null && rootServerLoad.getNumberOfRegions() > 0) 
{
+          // If rootServer is online && not start just now, we expire it
+          this.serverManager.expireServer(rootServerInfo);
+        }
{code}

It looks like the processing of the meta server duplicates code from the 
processing of the root server.  Can we have instead the duplicated code out in 
a method? Is that possible?  Then pass in args for whether root or meta to 
process?

Regards the hard coded wait of 1000ms when looking for dir to go away, see 
'13.4.1.2.2. Sleeps in tests' in 
http://hbase.apache.org/book.html#hbase.tests... it seems like we should wait 
less than 1000 going by the reference guide.

DeadServer class looks much better.

I can't write a test for a patch on 0.90 and this v17 won't work for trunk at 
all.  The trunk patch will be very different which is a problem because then I 
have no confidence my trunk unit test applies to this patch that is to be 
applied on 0.90 branch.
                
> Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
> region to be assigned before log splitting is completed, causing data loss
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5179
>                 URL: https://issues.apache.org/jira/browse/HBASE-5179
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.2
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>            Priority: Critical
>             Fix For: 0.92.0, 0.94.0, 0.90.6
>
>         Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
> 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
> 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 
> 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
> 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 
> 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, 
> hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
> hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, 
> hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch
>
>
> If master's processing its failover and ServerShutdownHandler's processing 
> happen concurrently, it may appear following  case.
> 1.master completed splitLogAfterStartup()
> 2.RegionserverA restarts, and ServerShutdownHandler is processing.
> 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
> dead server.
> 4.master starts to assign regions of RegionserverA because it is a dead 
> server by step3.
> However, when doing step4(assigning region), ServerShutdownHandler may be 
> doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

Reply via email to