[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

Mike Drob (JIRA) Tue, 06 Mar 2018 13:02:25 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388528#comment-16388528
 ]


Mike Drob commented on HBASE-20137:
-----------------------------------

{code:title=RegionTransitionProcedure.java}
-      remoteCallFailed(env, targetServer,
-          new FailedRemoteDispatchException(this + " to " + targetServer));
+      remoteCallFailed(env, targetServer, targetServer == null?
+          new FailedRemoteDispatchException():
+          new FailedRemoteDispatchException(targetServer.toShortString()));
{code}
when is target server null? if the remote peer is down we should still know who 
we were trying to reach, and would be good to report why we failed. logging 
that we were trying to talk to a null server is likely useless for operators 
because they have nowhere to look to corroborate or do RCA on why _that_ server 
went down.

{code:title=UnassignProcedure.java}
+        // Return false so this procedure stays in suspended state. It will be 
woken up by
+        // ServerCrashProcedure when it notices this RIT and calls this method 
again but with
+        // a SCPException -- see above.
+        // TODO: Add a SCP as a new subprocedure that we now come to depend on.
+        return false;
{code}
We need this or not? Unclear what will currently trigger the SCP. Is adding it 
here an optimization for MTTR but not required for correctness?

Saw something similar on a real cluster recently running a slightly older 
branch-2 where region server went down and master kept failing to do unassign. 
patch looks good overall, but I don't understand how we get out of the loop.

Address checkstyle?

> TestRSGroups is flakey
> ----------------------
>
>                 Key: HBASE-20137
>                 URL: https://issues.apache.org/jira/browse/HBASE-20137
>             Project: HBase
>          Issue Type: Bug
>          Components: flakey
>    Affects Versions: 2.0.0-beta-2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, 
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire.... only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

Reply via email to