[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Allan Yang (JIRA) Tue, 22 Mar 2016 19:05:54 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207733#comment-15207733
 ]


Allan Yang commented on HBASE-7006:
-----------------------------------

I may find a bug in this implementation.
{code:java}
@@ -283,6 +319,9 @@ public class SplitLogManager extends ZooKeeperListener {
       }
     }
     waitForSplittingCompletion(batch, status);
+    // remove recovering regions from ZK
+    this.removeRecoveringRegionsFromZK(serverNames);
+
     if (batch.done != batch.installed) {
       batch.isDead = true;
       SplitLogCounters.tot_mgr_log_split_batch_err.incrementAndGet();
@@ -409,6 +448,171 @@ public class SplitLogManager extends ZooKeeperListener {
     return count;
   }
{code}
In your logic, you wait for the completion of the split batch task. And before 
you check if all job is done without error, you removed the recovering regions 
from ZK. After that, you check if the batch is done without error and resubmit 
the task in LogReplayHandler.
That is a big problem, you remove the region's recovering status in ZK before 
the split&replay log task is actually done.Though the split task will be 
resubmit again, but it will skip the regions that aren't in recovering state. 
That means some replays haven't done before the region can be read again, and 
that means data lose.
Can you look this problem for me? 

> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf, 
> hbase-7006-addendum.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, 
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch, 
> hbase-7006-combined.patch
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay

Reply via email to