[
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207733#comment-15207733
]
Allan Yang commented on HBASE-7006:
-----------------------------------
I may find a bug in this implementation.
{code:java}
@@ -283,6 +319,9 @@ public class SplitLogManager extends ZooKeeperListener {
}
}
waitForSplittingCompletion(batch, status);
+ // remove recovering regions from ZK
+ this.removeRecoveringRegionsFromZK(serverNames);
+
if (batch.done != batch.installed) {
batch.isDead = true;
SplitLogCounters.tot_mgr_log_split_batch_err.incrementAndGet();
@@ -409,6 +448,171 @@ public class SplitLogManager extends ZooKeeperListener {
return count;
}
{code}
In your logic, you wait for the completion of the split batch task. And before
you check if all job is done without error, you removed the recovering regions
from ZK. After that, you check if the batch is done without error and resubmit
the task in LogReplayHandler.
That is a big problem, you remove the region's recovering status in ZK before
the split&replay log task is actually done.Though the split task will be
resubmit again, but it will skip the regions that aren't in recovering state.
That means some replays haven't done before the region can be read again, and
that means data lose.
Can you look this problem for me?
> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
> Issue Type: New Feature
> Components: MTTR
> Reporter: stack
> Assignee: Jeffrey Zhong
> Priority: Critical
> Fix For: 0.98.0, 0.95.1
>
> Attachments: 7006-addendum-3.txt, LogSplitting Comparison.pdf,
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf,
> hbase-7006-addendum.patch, hbase-7006-combined-v1.patch,
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch,
> hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch,
> hbase-7006-combined-v8.patch, hbase-7006-combined-v9.patch,
> hbase-7006-combined.patch
>
>
> Just saw interesting issue where a cluster went down hard and 30 nodes had
> 1700 WALs to replay. Replay took almost an hour. It looks like it could run
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least. Can always punt.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)