[
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337586#comment-17337586
]
Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 7:31 PM:
-----------------------------------------------------------------------
This is what happens.
I have a nonstandard balancer configuration. It is moving regions during a
split.
hbase.balancer.period=60000
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15
I think there is a state desynchronization bug in the master when the balancer
decides to move regions around when a split is in progress.
The weird log lines reported here are for the split parent after a split
completes.
Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}
UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO [PEWorker-65]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e,
UNASSIGN}]
2021-04-30 19:11:39,878 INFO [PEWorker-66]
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e,
UNASSIGN
2021-04-30 19:11:39,879 INFO [PEWorker-66]
assignment.RegionStateStore: pid=80 updating hbase:meta
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO [PEWorker-66]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80,
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 19:11:44,814 INFO [PEWorker-68]
assignment.RegionStateStore: pid=80 updating hbase:meta
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSED
2021-04-30 19:11:44,818 INFO [PEWorker-68]
procedure2.ProcedureExecutor: Finished subprocedure pid=81, resume processing
ppid=80
2021-04-30 19:11:44,818 INFO [PEWorker-68]
procedure2.ProcedureExecutor: Finished pid=81, ppid=80, state=SUCCESS;
CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 4.7850
sec
2021-04-30 19:11:44,819 INFO [PEWorker-69]
procedure2.ProcedureExecutor: Finished subprocedure pid=80, resume processing
ppid=79
2021-04-30 19:11:44,820 INFO [PEWorker-69]
procedure2.ProcedureExecutor: Finished pid=80, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN in 4.9400 sec
{noformat}
Followed by store splitting:
{noformat}
2021-04-30 19:11:44,837 INFO [PEWorker-70]
assignment.SplitTableRegionProcedure: pid=79 splitting 10 storefiles,
region=22ab65a3e71c3ab7f94c1c47067ad54e, threads=10
{noformat}
Followed by reassignments:
{noformat}
2021-04-30 19:11:45,089 INFO [PEWorker-70]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN}, {pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN}]
2021-04-30 19:11:45,090 INFO [PEWorker-71]
procedure.MasterProcedureScheduler: Took xlock for pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN
2021-04-30 19:11:45,090 INFO [PEWorker-72]
procedure.MasterProcedureScheduler: Took xlock for pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN
2021-04-30 19:11:45,091 INFO [PEWorker-71]
assignment.TransitRegionStateProcedure: Starting pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN; state=SPLITTING_NEW,
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877;
forceNewPlan=false, retain=false
2021-04-30 19:11:45,091 INFO [PEWorker-72]
assignment.TransitRegionStateProcedure: Starting pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN; state=SPLITTING_NEW,
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877;
forceNewPlan=false, retain=false
{noformat}
Balancer says it reassigned two regions??
{noformat}
2021-04-30 19:11:45,241 INFO [ip-172-31-58-47:8100]
balancer.BaseLoadBalancer: Reassigned 2 regions. 2 retained the pre-restart
assignment.
{noformat}
Now there are region re-opens:
{noformat}
2021-04-30 19:11:45,241 INFO [PEWorker-73]
assignment.RegionStateStore: pid=83 updating hbase:meta
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPENING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,241 INFO [PEWorker-74]
assignment.RegionStateStore: pid=82 updating hbase:meta
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPENING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,244 INFO [PEWorker-74]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=84, ppid=82,
state=RUNNABLE; OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,244 INFO [PEWorker-73]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=85, ppid=83,
state=RUNNABLE; OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,498 INFO [PEWorker-77]
assignment.RegionStateStore: pid=83 updating hbase:meta
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPEN, openSeqNum=54041,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO [PEWorker-81]
assignment.RegionStateStore: pid=82 updating hbase:meta
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPEN, openSeqNum=54041,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO [PEWorker-77]
procedure2.ProcedureExecutor: Finished subprocedure pid=85, resume processing
ppid=83
2021-04-30 19:11:45,502 INFO [PEWorker-77]
procedure2.ProcedureExecutor: Finished pid=85, ppid=83, state=SUCCESS;
OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 257 msec
2021-04-30 19:11:45,503 INFO [PEWorker-82]
procedure2.ProcedureExecutor: Finished pid=83, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN in 413 msec
2021-04-30 19:11:45,505 INFO [PEWorker-81]
procedure2.ProcedureExecutor: Finished subprocedure pid=84, resume processing
ppid=82
2021-04-30 19:11:45,505 INFO [PEWorker-81]
procedure2.ProcedureExecutor: Finished pid=84, ppid=82, state=SUCCESS;
OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 260 msec
2021-04-30 19:11:45,506 INFO [PEWorker-83]
procedure2.ProcedureExecutor: Finished subprocedure pid=82, resume processing
ppid=79
2021-04-30 19:11:45,506 INFO [PEWorker-83]
procedure2.ProcedureExecutor: Finished pid=82, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN in 416 msec
2021-04-30 19:11:45,507 INFO [PEWorker-79]
procedure2.ProcedureExecutor: Finished pid=79, state=SUCCESS;
SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl,
parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658 in 5.6370 sec
{noformat}
Done! Everything is great, right? Nope, now this starts at next balancer
iteration:
{noformat}
2021-04-30 19:11:50,055 INFO [master/ip-172-31-58-47:8100.Chore.1]
assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e
{noformat}
was (Author: apurtell):
This is what happens.
I have a nonstandard balancer configuration. It is moving regions during a
split.
hbase.balancer.period=60000
hbase.master.balancer.stochastic.minCostNeedBalance=0.025
hbase.master.balancer.stochastic.regionCountCost=100
hbase.master.balancer.stochastic.localityCost=30
hbase.master.balancer.stochastic.moveCost=10
hbase.master.balancer.stochastic.readRequestCost=10
hbase.master.balancer.stochastic.writeRequestCost=15
I think there is a state desynchronization bug in the master when the balancer
decides to move regions around when a split is in progress.
Split procedure begins:
{noformat}
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
2021-04-30 19:11:39,874 INFO [PEWorker-65]
procedure.MasterProcedureScheduler: Took xlock for pid=79,
state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure
table=IntegrationTestLoadCommonCrawl, parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658
{noformat}
UNASSIGN parent region (22ab65a3e71c3ab7f94c1c47067ad54e), includes WARN for
failed transition:
{noformat}
2021-04-30 19:11:39,878 INFO [PEWorker-65]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=80, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e,
UNASSIGN}]
2021-04-30 19:11:39,878 INFO [PEWorker-66]
procedure.MasterProcedureScheduler: Took xlock for pid=80, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e,
UNASSIGN
2021-04-30 19:11:39,879 INFO [PEWorker-66]
assignment.RegionStateStore: pid=80 updating hbase:meta
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:40,032 INFO [PEWorker-66]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=81, ppid=80,
state=RUNNABLE; CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:40,303 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
22ab65a3e71c3ab7f94c1c47067ad54e is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
2021-04-30 19:11:44,814 INFO [PEWorker-68]
assignment.RegionStateStore: pid=80 updating hbase:meta
row=22ab65a3e71c3ab7f94c1c47067ad54e, regionState=CLOSED
2021-04-30 19:11:44,818 INFO [PEWorker-68]
procedure2.ProcedureExecutor: Finished subprocedure pid=81, resume processing
ppid=80
2021-04-30 19:11:44,818 INFO [PEWorker-68]
procedure2.ProcedureExecutor: Finished pid=81, ppid=80, state=SUCCESS;
CloseRegionProcedure 22ab65a3e71c3ab7f94c1c47067ad54e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 4.7850
sec
2021-04-30 19:11:44,819 INFO [PEWorker-69]
procedure2.ProcedureExecutor: Finished subprocedure pid=80, resume processing
ppid=79
2021-04-30 19:11:44,820 INFO [PEWorker-69]
procedure2.ProcedureExecutor: Finished pid=80, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=22ab65a3e71c3ab7f94c1c47067ad54e, UNASSIGN in 4.9400 sec
{noformat}
Followed by store splitting:
{noformat}
2021-04-30 19:11:44,837 INFO [PEWorker-70]
assignment.SplitTableRegionProcedure: pid=79 splitting 10 storefiles,
region=22ab65a3e71c3ab7f94c1c47067ad54e, threads=10
{noformat}
Followed by reassignments:
{noformat}
2021-04-30 19:11:45,089 INFO [PEWorker-70]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN}, {pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN}]
2021-04-30 19:11:45,090 INFO [PEWorker-71]
procedure.MasterProcedureScheduler: Took xlock for pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN
2021-04-30 19:11:45,090 INFO [PEWorker-72]
procedure.MasterProcedureScheduler: Took xlock for pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN
2021-04-30 19:11:45,091 INFO [PEWorker-71]
assignment.TransitRegionStateProcedure: Starting pid=83, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN; state=SPLITTING_NEW,
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877;
forceNewPlan=false, retain=false
2021-04-30 19:11:45,091 INFO [PEWorker-72]
assignment.TransitRegionStateProcedure: Starting pid=82, ppid=79,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN; state=SPLITTING_NEW,
location=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877;
forceNewPlan=false, retain=false
{noformat}
Balancer says it reassigned two regions??
{noformat}
2021-04-30 19:11:45,241 INFO [ip-172-31-58-47:8100]
balancer.BaseLoadBalancer: Reassigned 2 regions. 2 retained the pre-restart
assignment.
{noformat}
Now there are region re-opens:
{noformat}
2021-04-30 19:11:45,241 INFO [PEWorker-73]
assignment.RegionStateStore: pid=83 updating hbase:meta
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPENING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,241 INFO [PEWorker-74]
assignment.RegionStateStore: pid=82 updating hbase:meta
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPENING,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,244 INFO [PEWorker-74]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=84, ppid=82,
state=RUNNABLE; OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,244 INFO [PEWorker-73]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=85, ppid=83,
state=RUNNABLE; OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877}]
2021-04-30 19:11:45,498 INFO [PEWorker-77]
assignment.RegionStateStore: pid=83 updating hbase:meta
row=4c53f22c3fe52af4a882a5f7936aa658, regionState=OPEN, openSeqNum=54041,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO [PEWorker-81]
assignment.RegionStateStore: pid=82 updating hbase:meta
row=2b607331c1f37f7efa21ae6c2281c89e, regionState=OPEN, openSeqNum=54041,
regionLocation=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877
2021-04-30 19:11:45,502 INFO [PEWorker-77]
procedure2.ProcedureExecutor: Finished subprocedure pid=85, resume processing
ppid=83
2021-04-30 19:11:45,502 INFO [PEWorker-77]
procedure2.ProcedureExecutor: Finished pid=85, ppid=83, state=SUCCESS;
OpenRegionProcedure 4c53f22c3fe52af4a882a5f7936aa658,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 257 msec
2021-04-30 19:11:45,503 INFO [PEWorker-82]
procedure2.ProcedureExecutor: Finished pid=83, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=4c53f22c3fe52af4a882a5f7936aa658, ASSIGN in 413 msec
2021-04-30 19:11:45,505 INFO [PEWorker-81]
procedure2.ProcedureExecutor: Finished subprocedure pid=84, resume processing
ppid=82
2021-04-30 19:11:45,505 INFO [PEWorker-81]
procedure2.ProcedureExecutor: Finished pid=84, ppid=82, state=SUCCESS;
OpenRegionProcedure 2b607331c1f37f7efa21ae6c2281c89e,
server=ip-172-31-63-65.us-west-2.compute.internal,8120,1619809716877 in 260 msec
2021-04-30 19:11:45,506 INFO [PEWorker-83]
procedure2.ProcedureExecutor: Finished subprocedure pid=82, resume processing
ppid=79
2021-04-30 19:11:45,506 INFO [PEWorker-83]
procedure2.ProcedureExecutor: Finished pid=82, ppid=79, state=SUCCESS;
TransitRegionStateProcedure table=IntegrationTestLoadCommonCrawl,
region=2b607331c1f37f7efa21ae6c2281c89e, ASSIGN in 416 msec
2021-04-30 19:11:45,507 INFO [PEWorker-79]
procedure2.ProcedureExecutor: Finished pid=79, state=SUCCESS;
SplitTableRegionProcedure table=IntegrationTestLoadCommonCrawl,
parent=22ab65a3e71c3ab7f94c1c47067ad54e,
daughterA=2b607331c1f37f7efa21ae6c2281c89e,
daughterB=4c53f22c3fe52af4a882a5f7936aa658 in 5.6370 sec
{noformat}
Done! Everything is great, right? Nope, now this starts at next balancer
iteration:
{noformat}
2021-04-30 19:11:50,055 INFO [master/ip-172-31-58-47:8100.Chore.1]
assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
table=IntegrationTestLoadCommonCrawl, region=22ab65a3e71c3ab7f94c1c47067ad54e
{noformat}
> SPLIT state detritus
> --------------------
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.3
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test
> writes a log to HDFS of every cell written to HBase and the verify phase uses
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO
> level log messages printed of the form _assignment.RegionStates: Skipping, no
> server for state=SPLIT, location=null, table=TABLENAME_
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with
> server=null.
> {noformat}
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0
> orphan regions
> 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0
> orphan regions
> {noformat}
> This repeats indefinitely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)