[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337586#comment-17337586
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:31 PM:
-----------------------------------------------------------------------

This is what happens.
I have a nonstandard balancer configuration. It is moving regions during a 
split. 

hbase.balancer.period=60000
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15

I think there is a state desynchronization bug in the master when the balancer 
decides to move regions around when a split is in progress. 

The weird log lines reported here are for the split parent after a split 
completes. 

Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}

UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for 
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO  [PEWorker-65] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN}]
2021-04-30 19:11:39,878 INFO  [PEWorker-66] 
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN
2021-04-30 19:11:39,879 INFO  [PEWorker-66] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO  [PEWorker-66] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, 
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
        at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
        at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
        at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 19:11:44,814 INFO  [PEWorker-68] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSED
2021-04-30 19:11:44,818 INFO  [PEWorker-68] 
procedure2.ProcedureExecutor: Finished subprocedure pid=81, resume processing 
ppid=80
2021-04-30 19:11:44,818 INFO  [PEWorker-68] 
procedure2.ProcedureExecutor: Finished pid=81, ppid=80, state=SUCCESS; 
CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 4.7850 
sec
2021-04-30 19:11:44,819 INFO  [PEWorker-69] 
procedure2.ProcedureExecutor: Finished subprocedure pid=80, resume processing 
ppid=79
2021-04-30 19:11:44,820 INFO  [PEWorker-69] 
procedure2.ProcedureExecutor: Finished pid=80, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN in 4.9400 sec
{noformat}

Followed by store splitting:
{noformat}
2021-04-30 19:11:44,837 INFO  [PEWorker-70] 
assignment.SplitTableRegionProcedure: pid=79 splitting 10 storefiles, 
region=22ab65a3e71c3ab7f94c1c47067ad54e, threads=10
{noformat}

Followed by reassignments:
{noformat}
2021-04-30 19:11:45,089 INFO  [PEWorker-70] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN}, {pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN}]
2021-04-30 19:11:45,090 INFO  [PEWorker-71] 
procedure.MasterProcedureScheduler: Took xlock for pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN
2021-04-30 19:11:45,090 INFO  [PEWorker-72] 
procedure.MasterProcedureScheduler: Took xlock for pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN
2021-04-30 19:11:45,091 INFO  [PEWorker-71] 
assignment.TransitRegionStateProcedure: Starting pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN; state=SPLITTING_NEW, 
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877; 
forceNewPlan=false, retain=false
2021-04-30 19:11:45,091 INFO  [PEWorker-72] 
assignment.TransitRegionStateProcedure: Starting pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN; state=SPLITTING_NEW, 
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877; 
forceNewPlan=false, retain=false
{noformat}

Balancer says it reassigned two regions??
{noformat}
2021-04-30 19:11:45,241 INFO  [ip-172-31-58-47:8100] 
balancer.BaseLoadBalancer: Reassigned 2 regions. 2 retained the pre-restart 
assignment. 
{noformat}

Now there are region re-opens:
{noformat}
2021-04-30 19:11:45,241 INFO  [PEWorker-73] 
assignment.RegionStateStore: pid=83 updating hbase:meta 
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPENING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,241 INFO  [PEWorker-74] 
assignment.RegionStateStore: pid=82 updating hbase:meta 
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPENING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,244 INFO  [PEWorker-74] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=84, ppid=82, 
state=RUNNABLE; OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,244 INFO  [PEWorker-73] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=85, ppid=83, 
state=RUNNABLE; OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,498 INFO  [PEWorker-77] 
assignment.RegionStateStore: pid=83 updating hbase:meta 
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPEN, openSeqNum=54041, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO  [PEWorker-81] 
assignment.RegionStateStore: pid=82 updating hbase:meta 
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPEN, openSeqNum=54041, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO  [PEWorker-77] 
procedure2.ProcedureExecutor: Finished subprocedure pid=85, resume processing 
ppid=83
2021-04-30 19:11:45,502 INFO  [PEWorker-77] 
procedure2.ProcedureExecutor: Finished pid=85, ppid=83, state=SUCCESS; 
OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 257 msec
2021-04-30 19:11:45,503 INFO  [PEWorker-82] 
procedure2.ProcedureExecutor: Finished pid=83, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN in 413 msec
2021-04-30 19:11:45,505 INFO  [PEWorker-81] 
procedure2.ProcedureExecutor: Finished subprocedure pid=84, resume processing 
ppid=82
2021-04-30 19:11:45,505 INFO  [PEWorker-81] 
procedure2.ProcedureExecutor: Finished pid=84, ppid=82, state=SUCCESS; 
OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 260 msec
2021-04-30 19:11:45,506 INFO  [PEWorker-83] 
procedure2.ProcedureExecutor: Finished subprocedure pid=82, resume processing 
ppid=79
2021-04-30 19:11:45,506 INFO  [PEWorker-83] 
procedure2.ProcedureExecutor: Finished pid=82, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN in 416 msec
2021-04-30 19:11:45,507 INFO  [PEWorker-79] 
procedure2.ProcedureExecutor: Finished pid=79, state=SUCCESS; 
SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, 
parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658 in 5.6370 sec
{noformat}

Done! Everything is great, right? Nope, now this starts at next balancer 
iteration:
{noformat}
2021-04-30 19:11:50,055 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e
{noformat}


was (Author: apurtell):
This is what happens.
I have a nonstandard balancer configuration. It is moving regions during a 
split. 

hbase.balancer.period=60000
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15

I think there is a state desynchronization bug in the master when the balancer 
decides to move regions around when a split is in progress. 

Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO  [PEWorker-65] 
procedure.MasterProcedureScheduler: Took xlock for pid=79, 
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure 
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}

UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for 
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO  [PEWorker-65] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN}]
2021-04-30 19:11:39,878 INFO  [PEWorker-66] 
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e, 
UNASSIGN
2021-04-30 19:11:39,879 INFO  [PEWorker-66] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO  [PEWorker-66] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80, 
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException: 
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
        at 
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
        at 
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
        at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 19:11:44,814 INFO  [PEWorker-68] 
assignment.RegionStateStore: pid=80 updating hbase:meta 
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSED
2021-04-30 19:11:44,818 INFO  [PEWorker-68] 
procedure2.ProcedureExecutor: Finished subprocedure pid=81, resume processing 
ppid=80
2021-04-30 19:11:44,818 INFO  [PEWorker-68] 
procedure2.ProcedureExecutor: Finished pid=81, ppid=80, state=SUCCESS; 
CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 4.7850 
sec
2021-04-30 19:11:44,819 INFO  [PEWorker-69] 
procedure2.ProcedureExecutor: Finished subprocedure pid=80, resume processing 
ppid=79
2021-04-30 19:11:44,820 INFO  [PEWorker-69] 
procedure2.ProcedureExecutor: Finished pid=80, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN in 4.9400 sec
{noformat}

Followed by store splitting:
{noformat}
2021-04-30 19:11:44,837 INFO  [PEWorker-70] 
assignment.SplitTableRegionProcedure: pid=79 splitting 10 storefiles, 
region=22ab65a3e71c3ab7f94c1c47067ad54e, threads=10
{noformat}

Followed by reassignments:
{noformat}
2021-04-30 19:11:45,089 INFO  [PEWorker-70] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN}, {pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN}]
2021-04-30 19:11:45,090 INFO  [PEWorker-71] 
procedure.MasterProcedureScheduler: Took xlock for pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN
2021-04-30 19:11:45,090 INFO  [PEWorker-72] 
procedure.MasterProcedureScheduler: Took xlock for pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN
2021-04-30 19:11:45,091 INFO  [PEWorker-71] 
assignment.TransitRegionStateProcedure: Starting pid=83, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN; state=SPLITTING_NEW, 
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877; 
forceNewPlan=false, retain=false
2021-04-30 19:11:45,091 INFO  [PEWorker-72] 
assignment.TransitRegionStateProcedure: Starting pid=82, ppid=79, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN; state=SPLITTING_NEW, 
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877; 
forceNewPlan=false, retain=false
{noformat}

Balancer says it reassigned two regions??
{noformat}
2021-04-30 19:11:45,241 INFO  [ip-172-31-58-47:8100] 
balancer.BaseLoadBalancer: Reassigned 2 regions. 2 retained the pre-restart 
assignment. 
{noformat}

Now there are region re-opens:
{noformat}
2021-04-30 19:11:45,241 INFO  [PEWorker-73] 
assignment.RegionStateStore: pid=83 updating hbase:meta 
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPENING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,241 INFO  [PEWorker-74] 
assignment.RegionStateStore: pid=82 updating hbase:meta 
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPENING, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,244 INFO  [PEWorker-74] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=84, ppid=82, 
state=RUNNABLE; OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,244 INFO  [PEWorker-73] 
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=85, ppid=83, 
state=RUNNABLE; OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,498 INFO  [PEWorker-77] 
assignment.RegionStateStore: pid=83 updating hbase:meta 
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPEN, openSeqNum=54041, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO  [PEWorker-81] 
assignment.RegionStateStore: pid=82 updating hbase:meta 
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPEN, openSeqNum=54041, 
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO  [PEWorker-77] 
procedure2.ProcedureExecutor: Finished subprocedure pid=85, resume processing 
ppid=83
2021-04-30 19:11:45,502 INFO  [PEWorker-77] 
procedure2.ProcedureExecutor: Finished pid=85, ppid=83, state=SUCCESS; 
OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 257 msec
2021-04-30 19:11:45,503 INFO  [PEWorker-82] 
procedure2.ProcedureExecutor: Finished pid=83, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN in 413 msec
2021-04-30 19:11:45,505 INFO  [PEWorker-81] 
procedure2.ProcedureExecutor: Finished subprocedure pid=84, resume processing 
ppid=82
2021-04-30 19:11:45,505 INFO  [PEWorker-81] 
procedure2.ProcedureExecutor: Finished pid=84, ppid=82, state=SUCCESS; 
OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e, 
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 260 msec
2021-04-30 19:11:45,506 INFO  [PEWorker-83] 
procedure2.ProcedureExecutor: Finished subprocedure pid=82, resume processing 
ppid=79
2021-04-30 19:11:45,506 INFO  [PEWorker-83] 
procedure2.ProcedureExecutor: Finished pid=82, ppid=79, state=SUCCESS; 
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl, 
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN in 416 msec
2021-04-30 19:11:45,507 INFO  [PEWorker-79] 
procedure2.ProcedureExecutor: Finished pid=79, state=SUCCESS; 
SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl, 
parent=22ab65a3e71c3ab7f94c1c47067ad54e, 
daughterA=2b607331c1f37f7efa21ae6c2281c89e, 
daughterB=4c53f22c3fe52af4a882a5f7936aa658 in 5.6370 sec
{noformat}

Done! Everything is great, right? Nope, now this starts at next balancer 
iteration:
{noformat}
2021-04-30 19:11:50,055 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e
{noformat}

> SPLIT state detritus
> --------------------
>
>                 Key: HBASE-25829
>                 URL: https://issues.apache.org/jira/browse/HBASE-25829
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 
> orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to