[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

stack (JIRA) Tue, 06 Mar 2018 13:40:48 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388571#comment-16388571
 ]


stack commented on HBASE-20137:
-------------------------------

bq. when is target server null? if the remote peer is down we should still know 
who we were trying to reach, and would be good to report why we failed. logging 
that we were trying to talk to a null server is likely useless for operators 
because they have nowhere to look to corroborate or do RCA on why that server 
went down.

Understood. When we log, we generally log all procedure context including the 
server we are trying to talk to, always, intentionally, so we can trace through 
all steps.

If target server is null at this stage something is seriously wrong. Tracing 
across the life of the procedure will hopefully turn up where we went awry (Its 
null in test where mock loads of context, see above TestAssignProcedure failure 
for full context, where non-null would be awkward to insert -- it used to just 
print as 'null'... could have just done same. Gave it little mind because this 
is actually redundant info).

Including target server in the FailedRemoteDispatchException message is 
redundant (the old message which included the procedure context again was super 
redundant and confused). Should probably just do w/o target server as message 
in this FailedRemoteDispatchException filler exception.

This FailedRemoteDispatchException is an oddball. Its a marker. It denotes the 
case where we failed queue of the unassign close rpc. The only reason we'd fail 
at this point is that there is no entry for the server we are trying to contact 
-- its crashed most likely (I tried to backfill 'cause' but none to insert. 
Refactoring inserting an exception would probably make sense one day... but 
would take work to manage the ripple given we are suspended at this point and 
are trying to exit the execution promptly... ).

bq.  Is adding it here an optimization for MTTR but not required for 
correctness?

Right. Something to consider. Old comment IIRC. Would love to spend time on 
AMv2. Currently am in minimal, evidence-based changes mode.

bq ....where region server went down and master kept failing to do unassign

That sounds related. AMv2 depends on ServerCrashProcedure running to clean up 
ongoing assigns and unassigns that will never succeed (because server went 
away... there is NO timeout as yet). The fix here is for a corner case where 
the SCP doesn't get scheduled... because we are shutting down. We had no 
handling for this scenario.

I looked at the checkstyle and punted on it because it meant changing a bunch 
of lines.... let me do it in follow-up.

Appreciate the review. Thanks [~mdrob]

> TestRSGroups is flakey
> ----------------------
>
>                 Key: HBASE-20137
>                 URL: https://issues.apache.org/jira/browse/HBASE-20137
>             Project: HBase
>          Issue Type: Bug
>          Components: flakey
>    Affects Versions: 2.0.0-beta-2
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: HBASE-20137.branch-2.001.patch, 
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch, 
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the 
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause 
> calling into question some of the clauses in 
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the 
> disable. pid=311 is the interesting one. The below is a little hard to read 
> -- the exception 'message' is the the current procedure as a String... hard 
> to parse, fixing -- but we are trying to unassign as part of a the 
> disabletable. Our RPC fails because the server we are trying to rpc too is 
> currently being processed as crashed (pid=308 is a servercrashprocedure for 
> this server). As part of the processing of the failed RPC we will expire the 
> server -- if we can't RPC to it, it must be gone. The current procedure is 
> then suspended until it gets woken up by the servercrashprocedure triggered 
> by the expire.... only in this case we are shutting down so the expire is 
> ignored... The current procedure is left in its suspend state. This prevents 
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO  [PEWorker-13] 
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311, 
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] 
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING, 
> location=1cfd208ff882,40584,1520249102524, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=Group_ns:testKillRS, 
> region=de7534c208a06502537cd95c248b3043, 
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): 
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in 
> progress
> I need to cater for case where the expire server is rejected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20137) TestRSGroups is flakey

Reply via email to