[
https://issues.apache.org/jira/browse/HBASE-18525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115670#comment-16115670
]
Umesh Agashe commented on HBASE-18525:
--------------------------------------
Hi [~tedyu],
I think with current code UnassignProcedure is expected to fail and in short
term the test can be changed accordingly.
The test instantiates SocketTimeoutRsExecutor with maxServerRetries of 3. In
case of AssignProcedure retries are determined by
AssignmentManager.getAssignMaxAttempts() but in case of UnassignProcedure on
first failure procedure is determined to be failed. Long term fix is to have
max attempts for unassign operation like assign.
Pre HBASE-18491, we just assumed that when communication fails with RS,
ServerCrashProcedure will take care of reassigning region and unaasign region
operation can be considered successful. The assumption is a cause of the
failure of other test re-enabled in HBASE-18491.
Let me know your thoughts.
Thanks, Umesh
> TestAssignmentManager#testSocketTimeout fails in master branch
> --------------------------------------------------------------
>
> Key: HBASE-18525
> URL: https://issues.apache.org/jira/browse/HBASE-18525
> Project: HBase
> Issue Type: Bug
> Reporter: Ted Yu
> Assignee: Ted Yu
> Attachments: 18525.v1.txt
>
>
> Toward the end of the test output, I saw:
> {code}
> 2017-08-05 03:30:16,591 INFO [Time-limited test]
> assignment.TestAssignmentManager(446): ExecutionException
> java.util.concurrent.ExecutionException:
> org.apache.hadoop.hbase.master.procedure.ServerCrashException:
> ServerCrashProcedure pid=3, server=localhost,103,1
> at
> org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait$ProcedureFuture.get(ProcedureSyncWait.java:104)
> at
> org.apache.hadoop.hbase.master.procedure.ProcedureSyncWait$ProcedureFuture.get(ProcedureSyncWait.java:62)
> at
> org.apache.hadoop.hbase.master.assignment.TestAssignmentManager.waitOnFuture(TestAssignmentManager.java:444)
> at
> org.apache.hadoop.hbase.master.assignment.TestAssignmentManager.testSocketTimeout(TestAssignmentManager.java:255)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.master.procedure.ServerCrashException:
> ServerCrashProcedure pid=3, server=localhost,103,1
> at
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:169)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:274)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:57)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:847)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1440)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1209)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:79)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1719)
> {code}
> This test failure seems to happen after HBASE-18491 was checked in.
> Looking at the change in UnassignProcedure, it seems we should handle the two
> conditions differently:
> {code}
> if (serverCrashed.get() || !isServerOnline(env, regionNode)) {
> {code}
> With attached patch, TestAssignmentManager#testSocketTimeout and
> TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta pass.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)