[
https://issues.apache.org/jira/browse/HBASE-13895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611546#comment-14611546
]
stack commented on HBASE-13895:
-------------------------------
Ok. Added missing patch and the addendum that fixes failing
TestAssignmentManagerOnCluster tests. Agree with fix for UT (I love unit tests).
For branch-1+ I applied addendum and checked I got all patch this time.
On branch-2, I applied the original patch plus version of master addendum. I
made master same as branch-1s. The master addendum makes logic different. Why
[~enis]? I'll addendum the master is intended. I am talking about this hunk in
master addendum patch:
{code}
14 @@ -891,12 +891,16 @@ public class AssignmentManager {
15 LOG.warn("Server " + server + " region CLOSE RPC returned false
for " +
16 region.getRegionNameAsString());
17 } catch (Throwable t) {
18 + long sleepTime = 0;
19 + Configuration conf = this.server.getConfiguration();
20 if (t instanceof RemoteException) {
21 t = ((RemoteException)t).unwrapRemoteException();
22 }
23 - if (t instanceof NotServingRegionException
24 + if (t instanceof RegionServerAbortedException
25 || t instanceof RegionServerStoppedException
26 || t instanceof ServerNotRunningYetException) {
27 +
28 + } else if (t instanceof NotServingRegionException) {
29 LOG.debug("Offline " + region.getRegionNameAsString()
30 + ", it's not any more on " + server, t);
31 regionStates.updateRegionState(region, State.OFFLINE);
{code}
whereas in original patch we have this (set a sleeptime...)
{code}
411 @@ -1866,11 +1867,19 @@ public class AssignmentManager extends
ZooKeeperListener {
412 LOG.warn("Server " + server + " region CLOSE RPC returned false
for " +
413 region.getRegionNameAsString());
414 } catch (Throwable t) {
415 + long sleepTime = 0;
416 + Configuration conf = this.server.getConfiguration();
417 if (t instanceof RemoteException) {
418 t = ((RemoteException)t).unwrapRemoteException();
419 }
420 boolean logRetries = true;
421 - if (t instanceof NotServingRegionException
422 + if (t instanceof RegionServerAbortedException) {
423 + // RS is aborting, we cannot offline the region since the region
may need to do WAL
424 + // recovery. Until we see the RS expiration, we should retry.
425 + sleepTime = 1 + conf.getInt(RpcClient.FAILED_SERVER_EXPIRY_KEY,
426 + RpcClient.FAILED_SERVER_EXPIRY_DEFAULT);
427 +
428 + } else if (t instanceof NotServingRegionException
429 || t instanceof RegionServerStoppedException
430 || t instanceof ServerNotRunningYetException) {
{code}
Thanks for catching my misapply.
> DATALOSS: Region assigned before WAL replay when abort
> ------------------------------------------------------
>
> Key: HBASE-13895
> URL: https://issues.apache.org/jira/browse/HBASE-13895
> Project: HBase
> Issue Type: Bug
> Affects Versions: 1.2.0
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0
>
> Attachments: 13895.master.patch, hbase-13895_addendum-master.patch,
> hbase-13895_addendum.patch, hbase-13895_v1-branch-1.1.patch
>
>
> Opening a place holder till finish analysis.
> I have dataloss running ITBLL at 3B (testing HBASE-13877). Most obvious
> culprit is the double-assignment that I can see.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)