[ 
https://issues.apache.org/jira/browse/HBASE-20552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-20552:
---------------------------
       Priority: Critical  (was: Major)
    Description: 
This was observed during cluster testing (source code sync'ed with hbase-2.0, 
built May 2nd):
{code}
2018-05-02 05:44:10,089 ERROR 
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
master.MasterRpcServices: Region server 
ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported a 
fatal error:
***** ABORTING region server 
ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-       
      1518143905142-279227-01-000007.hwx.site,16020,1525239609353, 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-        
  1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has 
otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
  at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
  at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
                       table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
  ... 7 more
 *****
Cause:
org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
     table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
  at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
  at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
                       table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
  ... 7 more

  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
{code}
[~elserj] and I (Ted) did some initial analysis.

In the following description, M1 refers to 
master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to 
master-ctr-e138-1518143905142-279227-01-000003.

Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .

Master 1 was moving the region to 000007:
{code}
2018-05-02 05:38:59,017 INFO  
[master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] master.HMaster: 
balance hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
2018-05-02 05:38:59,126 DEBUG 
[master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
procedure2.ProcedureExecutor: Stored pid=465, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
{code}
Region was closed on 000002:
{code}
2018-05-02 05:38:59,327 INFO  
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
regionserver.HRegion: Closed 
test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
2018-05-02 05:38:59,327 INFO  
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to 
ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record at 
close sequenceid=13395
2018-05-02 05:38:59,329 DEBUG 
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
handler.CloseRegionHandler: Closed 
test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
{code}

The master saw the destination crash:
{code}
2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: 
Processing expiration of 
ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on 
ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
{code}

Processed it:
{code}
2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: 
Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=true
2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=474, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
{code}

Master 1 tried to assign the region back on the original RS (that we tried to 
move it off of):
{code}
2018-05-02 05:39:46,136 INFO  [PEWorker-1] 
assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
{code}

Then Master 1 went down, Master 2 picked up the region.

M2 ran a MoveProcedure where it thought region was on 007:
{code}
2018-05-02 05:43:33,876 DEBUG 
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
{code}
and assigned region to 007:
{code}
2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28}]
2018-05-02 05:44:08,472 INFO  [PEWorker-6] procedure.MasterProcedureScheduler: 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 
94f6ca283dbb4445b2bcdc321b734d28
2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: Starting 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502; 
forceNewPlan=false, retain=true
2018-05-02 05:44:08,646 INFO  
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart 
assignment. 
2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: 
pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
regionState=OPENING, 
regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:08,727 INFO  [PEWorker-11] 
assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:08,768 INFO  
[master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master 
has completed initialization 41.430sec
2018-05-02 05:44:09,213 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,213 DEBUG [PEWorker-12] 
assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: 
pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
regionState=OPEN, openSeqNum=13402, 
regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=false; resume parent processing.
2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
Finished pid=507, state=SUCCESS; ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=false in 975msec
{code}

So when 000002 reported region to master 2, we got the exception.

  was:
This was observed during cluster testing (source code sync'ed with hbase-2.0, 
built May 2nd):
{code}
2018-05-02 05:44:10,089 ERROR 
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
master.MasterRpcServices: Region server 
ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported a 
fatal error:
***** ABORTING region server 
ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-       
      1518143905142-279227-01-000007.hwx.site,16020,1525239609353, 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-        
  1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has 
otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
  at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
  at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
                       table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
  ... 7 more
 *****
Cause:
org.apache.hadoop.hbase.YouAreDeadException: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
     table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
  at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
  at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
  at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
  at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
rit=OPEN, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,  
                       table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but 
state  has otherwise.
  at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
  ... 7 more

  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
{code}
[~elserj] and I did some initial analysis.

In the following description, M1 refers to 
master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to 
master-ctr-e138-1518143905142-279227-01-000003.

Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .

Master 1 was moving the region to 000007:
{code}
2018-05-02 05:38:59,017 INFO  
[master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] master.HMaster: 
balance hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
2018-05-02 05:38:59,126 DEBUG 
[master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
procedure2.ProcedureExecutor: Stored pid=465, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
{code}
Region was closed on 000002:
{code}
2018-05-02 05:38:59,327 INFO  
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
regionserver.HRegion: Closed 
test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
2018-05-02 05:38:59,327 INFO  
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to 
ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record at 
close sequenceid=13395
2018-05-02 05:38:59,329 DEBUG 
[RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1] 
handler.CloseRegionHandler: Closed 
test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
{code}

The master saw the destination crash:
{code}
2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: 
RegionServer ephemeral node deleted, processing expiration 
[ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: 
Processing expiration of 
ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on 
ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
{code}

Processed it:
{code}
2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: 
Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=true
2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=474, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_namenode, 
region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
{code}

Master 1 tried to assign the region back on the original RS (that we tried to 
move it off of):
{code}
2018-05-02 05:39:46,136 INFO  [PEWorker-1] 
assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
{code}

Then Master 1 went down, Master 2 picked up the region.

M2 ran a MoveProcedure where it thought region was on 007:
{code}
2018-05-02 05:43:33,876 DEBUG 
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
{code}
and assigned region to 007:
{code}
2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28}]
2018-05-02 05:44:08,472 INFO  [PEWorker-6] procedure.MasterProcedureScheduler: 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 
94f6ca283dbb4445b2bcdc321b734d28
2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: Starting 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502; 
forceNewPlan=false, retain=true
2018-05-02 05:44:08,646 INFO  
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart 
assignment. 
2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: 
pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
regionState=OPENING, 
regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:08,727 INFO  [PEWorker-11] 
assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:08,768 INFO  
[master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master 
has completed initialization 41.430sec
2018-05-02 05:44:09,213 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, 
pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,213 DEBUG [PEWorker-12] 
assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, 
state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: 
pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
regionState=OPEN, openSeqNum=13402, 
regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=false; resume parent processing.
2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure 
table=test_hbase_ha_load_test_tool_hbase, 
region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
Finished pid=507, state=SUCCESS; ServerCrashProcedure 
server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
splitWal=true, meta=false in 975msec
{code}

So when 000002 reported region to master 2, we got the exception.


> HBase RegionServer was shutdown due to UnexpectedStateException
> ---------------------------------------------------------------
>
>                 Key: HBASE-20552
>                 URL: https://issues.apache.org/jira/browse/HBASE-20552
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Romil Choksi
>            Priority: Critical
>
> This was observed during cluster testing (source code sync'ed with hbase-2.0, 
> built May 2nd):
> {code}
> 2018-05-02 05:44:10,089 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> master.MasterRpcServices: Region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported 
> a fatal error:
> ***** ABORTING region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-     
>         1518143905142-279227-01-000007.hwx.site,16020,1525239609353, 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-      
>     1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has 
> otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>  *****
> Cause:
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>        table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> {code}
> [~elserj] and I (Ted) did some initial analysis.
> In the following description, M1 refers to 
> master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to 
> master-ctr-e138-1518143905142-279227-01-000003.
> Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .
> Master 1 was moving the region to 000007:
> {code}
> 2018-05-02 05:38:59,017 INFO  
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> master.HMaster: balance hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> 2018-05-02 05:38:59,126 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> procedure2.ProcedureExecutor: Stored pid=465, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> Region was closed on 000002:
> {code}
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegion: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record 
> at close sequenceid=13395
> 2018-05-02 05:38:59,329 DEBUG 
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  handler.CloseRegionHandler: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> {code}
> The master saw the destination crash:
> {code}
> 2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: 
> RegionServer ephemeral node deleted, processing expiration 
> [ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
> 2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: 
> Processing expiration of 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on 
> ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
> {code}
> Processed it:
> {code}
> 2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: 
> Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=true
> 2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=474, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
> {code}
> Master 1 tried to assign the region back on the original RS (that we tried to 
> move it off of):
> {code}
> 2018-05-02 05:39:46,136 INFO  [PEWorker-1] 
> assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
> {code}
> Then Master 1 went down, Master 2 picked up the region.
> M2 ran a MoveProcedure where it thought region was on 007:
> {code}
> 2018-05-02 05:43:33,876 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
> MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> and assigned region to 007:
> {code}
> 2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}]
> 2018-05-02 05:44:08,472 INFO  [PEWorker-6] 
> procedure.MasterProcedureScheduler: pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 
> 94f6ca283dbb4445b2bcdc321b734d28
> 2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: 
> Starting pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502;
>  forceNewPlan=false, retain=true
> 2018-05-02 05:44:08,646 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart 
> assignment. 
> 2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPENING, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,727 INFO  [PEWorker-11] 
> assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,768 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master 
> has completed initialization 41.430sec
> 2018-05-02 05:44:09,213 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, 
> pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,213 DEBUG [PEWorker-12] 
> assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPEN, openSeqNum=13402, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false; resume parent processing.
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
> 2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
> Finished pid=507, state=SUCCESS; ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false in 975msec
> {code}
> So when 000002 reported region to master 2, we got the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to