stack created HBASE-20796:
-----------------------------
Summary: STUCK RIT though region successfully assigned
Key: HBASE-20796
URL: https://issues.apache.org/jira/browse/HBASE-20796
Project: HBase
Issue Type: Bug
Components: amv2
Reporter: stack
Fix For: 2.0.0
This is a good one. We keep logging messages like this:
{code}
2018-06-26 12:32:24,859 WARN
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition rit=OPENING, location=vd0410.X.Y.com,22101,1529611445046,
table=IntegrationTestBigLinkedList_20180525080406,
region=e10b35d49528e2453a04c7038e3393d7
{code}
...though the region is successfully assigned.
Story:
* Dispatch an assign 2018-06-26 12:31:27,390 INFO
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch
pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
location=vd0410.X.Y.Z,22101,1529611445046
* It gets stuck 2018-06-26 12:32:29,860 WARN
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition rit=OPENING, location=vd0410.X.Y.Z,22101,1529611445046,
table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2 (Because the server was killed)
* We stay STUCK for a while.
* The Master notices the server as crashed and starts a SCP.
* SCP kills ongoing assign: 2018-06-26 12:32:54,809 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=371105 found
RIT pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
location=vd0410.X.Y.Z,22101,1529611445046
* The kill brings on a retry ... 2018-06-26 12:32:54,810 WARN
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
call failed pid=370829, ppid=370391, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
AssignProcedure table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPENING,
location=vd0410.X.Y.Z,22101,1529611445046; exception=ServerCrashProcedure
pid=371105, server=vd0410.X.Y.Z,22101,1529611445046
* Which eventually succeeds..... Successfully deployed to new server
2018-06-26 12:32:55,429 INFO
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=370829,
ppid=370391, state=SUCCESS; AssignProcedure
table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2 in 1mins, 35.379sec
* But then, it looks like the RPC was ongoing and it broke in following way
2018-06-26 12:33:06,378 WARN
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote
call failed pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
location=vc0614.halxg.cloudera.com,22101,1529611443424; exception=Call to
vd0410.X.Y.Z/10.10.10.10:22101 failed on local exception:
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer (Notice how state for region
is OPEN and 'SUCCESS').
* Then says 2018-06-26 12:33:06,380 INFO
org.apache.hadoop.hbase.master.assignment.AssignProcedure: Retry=1 of max=10;
pid=370829, ppid=370391, state=SUCCESS; AssignProcedure
table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2; rit=OPEN,
location=vc0614.X.Y.Z,22101,1529611443424
* And finally... 2018-06-26 12:34:10,727 WARN
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition rit=OFFLINE, location=null,
table=IntegrationTestBigLinkedList_20180612114844,
region=f69ccf7d9178ce166b515e0e2ef019d2
Restart of Master got rid of the STUCK complaints.
This is interesting because the stuck rpc and the successful reassign are all
riding on the same pid.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)