[
https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525903#comment-16525903
]
stack commented on HBASE-20796:
-------------------------------
In short the issue is that handleFailure is called twice, once by SCP cleaning
up stuck procedures and then much later, after the assign restarts and succeeds
against another server, when the rpc to the crashed server throws an exception
because the remote server went down and broke the connection. Only one of these
calls should be 'handled'. Its easy to tell which. As can be seen in the script
above, the second run shows wonky state: i.e. the region is OPEN though we were
at the OPENING phase.
The SCP call to handleFailure ideally would break outstanding RPCs, if any, by
interrupting them but Procedures and the RPC Dispatch are intentionally
disconnected and run at their own speeds. Let me file a follow-on to add an
interrupt to the RemoteDispatch Interface. There is none.
Meantime, here's a patch that will eat and log handleFailiures that come in
with unexpected state. Only needed when we assign. The unassign failure
handling (remoteCallFailed) works differently deferring to SCP always; if a
stuck RPC happens on unassign, we already log and eat it:
LOG.info("Ignoring; waiting on ServerCrashProcedure", exception);
While in here, code reading, I noticed we set state to offline before we go to
the AM to ask it undo the OPENING. Seems wrong. Changed order which means some
code path triggers now as it looks like it was supposed too but I think it
could have side effects so this is 2.1.0 only.
> STUCK RIT though region successfully assigned
> ---------------------------------------------
>
> Key: HBASE-20796
> URL: https://issues.apache.org/jira/browse/HBASE-20796
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Reporter: stack
> Priority: Major
> Fix For: 2.0.0
>
>
> This is a good one. We keep logging messages like this:
> {code}
> 2018-06-26 12:32:24,859 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING,
> location=vd0410.X.Y.com,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180525080406,
> region=e10b35d49528e2453a04c7038e3393d7
> {code}
> ...though the region is successfully assigned.
> Story:
> * Dispatch an assign 2018-06-26 12:31:27,390 INFO
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch
> pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * It gets stuck 2018-06-26 12:32:29,860 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
> * We stay STUCK for a while.
> * The Master notices the server as crashed and starts a SCP.
> * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105
> found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure
> pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
> * Which eventually succeeds..... Successfully deployed to new server
> 2018-06-26 12:32:55,429 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829,
> ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
> * But then, it looks like the RPC was ongoing and it broke in following way
> 2018-06-26 12:33:06,378 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to
> vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception:
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer (Notice how state for
> region is OPEN and 'SUCCESS').
> * Then says 2018-06-26 12:33:06,380 INFO
> org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10;
> pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.X.Y.Z,22101,1529611443424
> * And finally... 2018-06-26 12:34:10,727 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OFFLINE, location=null,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2
> Restart of Master got rid of the STUCK complaints.
> This is interesting because the stuck rpc and the successful reassign are all
> riding on the same pid.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)