[
https://issues.apache.org/jira/browse/HBASE-20137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388571#comment-16388571
]
stack commented on HBASE-20137:
-------------------------------
bq. when is target server null? if the remote peer is down we should still know
who we were trying to reach, and would be good to report why we failed. logging
that we were trying to talk to a null server is likely useless for operators
because they have nowhere to look to corroborate or do RCA on why that server
went down.
Understood. When we log, we generally log all procedure context including the
server we are trying to talk to, always, intentionally, so we can trace through
all steps.
If target server is null at this stage something is seriously wrong. Tracing
across the life of the procedure will hopefully turn up where we went awry (Its
null in test where mock loads of context, see above TestAssignProcedure failure
for full context, where non-null would be awkward to insert -- it used to just
print as 'null'... could have just done same. Gave it little mind because this
is actually redundant info).
Including target server in the FailedRemoteDispatchException message is
redundant (the old message which included the procedure context again was super
redundant and confused). Should probably just do w/o target server as message
in this FailedRemoteDispatchException filler exception.
This FailedRemoteDispatchException is an oddball. Its a marker. It denotes the
case where we failed queue of the unassign close rpc. The only reason we'd fail
at this point is that there is no entry for the server we are trying to contact
-- its crashed most likely (I tried to backfill 'cause' but none to insert.
Refactoring inserting an exception would probably make sense one day... but
would take work to manage the ripple given we are suspended at this point and
are trying to exit the execution promptly... ).
bq. Is adding it here an optimization for MTTR but not required for
correctness?
Right. Something to consider. Old comment IIRC. Would love to spend time on
AMv2. Currently am in minimal, evidence-based changes mode.
bq ....where region server went down and master kept failing to do unassign
That sounds related. AMv2 depends on ServerCrashProcedure running to clean up
ongoing assigns and unassigns that will never succeed (because server went
away... there is NO timeout as yet). The fix here is for a corner case where
the SCP doesn't get scheduled... because we are shutting down. We had no
handling for this scenario.
I looked at the checkstyle and punted on it because it meant changing a bunch
of lines.... let me do it in follow-up.
Appreciate the review. Thanks [~mdrob]
> TestRSGroups is flakey
> ----------------------
>
> Key: HBASE-20137
> URL: https://issues.apache.org/jira/browse/HBASE-20137
> Project: HBase
> Issue Type: Bug
> Components: flakey
> Affects Versions: 2.0.0-beta-2
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0
>
> Attachments: HBASE-20137.branch-2.001.patch,
> HBASE-20137.branch-2.002.patch, HBASE-20137.branch-2.003.patch,
> HBASE-20137.branch-2.003.patch
>
>
> It was the single test that failed the hbase-2 nightlies in #440 at the
> hadoop2 stage.
> The failure manifests as a timeout. It actually has an interesting cause
> calling into question some of the clauses in
> UnassignProcedure#remoteCallFailed.
> We are running a disabletable concurrent with a shutdown. pid=309 is the
> disable. pid=311 is the interesting one. The below is a little hard to read
> -- the exception 'message' is the the current procedure as a String... hard
> to parse, fixing -- but we are trying to unassign as part of a the
> disabletable. Our RPC fails because the server we are trying to rpc too is
> currently being processed as crashed (pid=308 is a servercrashprocedure for
> this server). As part of the processing of the failed RPC we will expire the
> server -- if we can't RPC to it, it must be gone. The current procedure is
> then suspended until it gets woken up by the servercrashprocedure triggered
> by the expire.... only in this case we are shutting down so the expire is
> ignored... The current procedure is left in its suspend state. This prevents
> the Master going down. So we time out.
> 2018-03-05 11:29:22,507 INFO [PEWorker-13]
> assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.RegionTransitionProcedure(187): Remote call failed pid=311,
> ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13]
> assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524; rit=CLOSING,
> location=1cfd208ff882,40584,1520249102524,
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
> pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> UnassignProcedure table=Group_ns:testKillRS,
> region=de7534c208a06502537cd95c248b3043,
> server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
> 2018-03-05 11:29:22,508 WARN [PEWorker-13] master.ServerManager(580):
> Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in
> progress
> I need to cater for case where the expire server is rejected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)