[
https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528413#comment-16528413
]
stack commented on HBASE-20796:
-------------------------------
Putting the patch aside for now.
* I see this failure on an internal cluster a bunch. Not in my usual runs.
Turns out it has a three minute rpc timeout where the default is one minute.
This means it is more likely that the clean up of rpcs are going to arrive
AFTER the procedure has been hijacked by ServerCrashProcedure thus
manufacturing this case. Took a while figuring the difference.
* Once I figured how to recreate this state, I messed on cluster repro'ing.
This patch as is is not enough. It runs but we still have the Procedure showing
as STUCK OPEN. The AM still has a RegionStateNode in OPEN stage. The
RegionStateNode is OPEN after the SCP is finished but we need to exit this
current AP that failed its RPC. I messed with various variants but only
produced more intricacies: why is the RegionStateNode around if the SCP
successfully assigned. How did SCP get this AP's region lock?
* I've been working on a test but its awkward reproducing this scenario.
Had to put this aside for now. Will be back.
> STUCK RIT though region successfully assigned
> ---------------------------------------------
>
> Key: HBASE-20796
> URL: https://issues.apache.org/jira/browse/HBASE-20796
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 3.0.0, 2.1.0, 2.0.2
>
> Attachments: HBASE-20796.branch-2.0.001.patch
>
>
> This is a good one. We keep logging messages like this:
> {code}
> 2018-06-26 12:32:24,859 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING,
> location=vd0410.X.Y.com,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180525080406,
> region=e10b35d49528e2453a04c7038e3393d7
> {code}
> ...though the region is successfully assigned.
> Story:
> * Dispatch an assign 2018-06-26 12:31:27,390 INFO
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch
> pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * It gets stuck 2018-06-26 12:32:29,860 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
> * We stay STUCK for a while.
> * The Master notices the server as crashed and starts a SCP.
> * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105
> found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046
> * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
> location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure
> pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
> * Which eventually succeeds..... Successfully deployed to new server
> 2018-06-26 12:32:55,429 INFO
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829,
> ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
> * But then, it looks like the RPC was ongoing and it broke in following way
> 2018-06-26 12:33:06,378 WARN
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
> call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to
> vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception:
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer (Notice how state for
> region is OPEN and 'SUCCESS').
> * Then says 2018-06-26 12:33:06,380 INFO
> org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10;
> pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
> location=vc0614.X.Y.Z,22101,1529611443424
> * And finally... 2018-06-26 12:34:10,727 WARN
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
> Region-In-Transition rit=OFFLINE, location=null,
> table=IntegrationTestBigLinkedList_20180612114844,
> region=f69ccf7d9178ce166b515e0e2ef019d2
> Restart of Master got rid of the STUCK complaints.
> This is interesting because the stuck rpc and the successful reassign are all
> riding on the same pid.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)