[ https://issues.apache.org/jira/browse/HBASE-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-20796. ------------------------------- Assignee: (was: Michael Stack) Resolution: Cannot Reproduce We have bunch of fix around this area. Resolve as can not reproduce for now. Feel free to reopen if it is still a problem. Thanks. > STUCK RIT though region successfully assigned (hung RPC) > -------------------------------------------------------- > > Key: HBASE-20796 > URL: https://issues.apache.org/jira/browse/HBASE-20796 > Project: HBase > Issue Type: Bug > Components: amv2 > Reporter: Michael Stack > Priority: Major > Attachments: 0001-Test.patch, HBASE-20796.branch-2.0.001.patch > > > This is a good one. We keep logging messages like this: > {code} > 2018-06-26 12:32:24,859 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition rit=OPENING, > location=vd0410.X.Y.com,22101,1529611445046, > table=IntegrationTestBigLinkedList_20180525080406, > region=e10b35d49528e2453a04c7038e3393d7 > {code} > ...though the region is successfully assigned. > Story: > * Dispatch an assign 2018-06-26 12:31:27,390 INFO > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch > pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, > location=vd0410.X.Y.Z,22101,1529611445046 > * It gets stuck 2018-06-26 12:32:29,860 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046, > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed) > * We stay STUCK for a while. > * The Master notices the server as crashed and starts a SCP. > * SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105 > found RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > AssignProcedure table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, > location=vd0410.X.Y.Z,22101,1529611445046 > * The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote > call failed pid=370829, ppid=370391, > state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING, > location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure > pid=371105, server=vd0410.X.Y.Z,22101,1529611445046 > * Which eventually succeeds..... Successfully deployed to new server > 2018-06-26 12:32:55,429 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829, > ppid=370391, state=SUCCESS; AssignProcedure > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec > * But then, it looks like the RPC was ongoing and it broke in following way > 2018-06-26 12:33:06,378 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote > call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, > location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to > vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception: > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer (Notice how state for > region is OPEN and 'SUCCESS'). > * Then says 2018-06-26 12:33:06,380 INFO > org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10; > pid=370829, ppid=370391, state=SUCCESS; AssignProcedure > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN, > location=vc0614.X.Y.Z,22101,1529611443424 > * And finally... 2018-06-26 12:34:10,727 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition rit=OFFLINE, location=null, > table=IntegrationTestBigLinkedList_20180612114844, > region=f69ccf7d9178ce166b515e0e2ef019d2 > Restart of Master got rid of the STUCK complaints. > This is interesting because the stuck rpc and the successful reassign are all > riding on the same pid. -- This message was sent by Atlassian Jira (v8.20.10#820010)