[
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337591#comment-17337591
]
Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:04 PM:
-----------------------------------------------------------------------
This warning repeats frequently and involves the split parents that accumulate
as bad state:
{noformat}
2021-04-30 19:35:01,115 INFO [PEWorker-27]
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815,
UNASSIGN
2021-04-30 19:35:01,116 INFO [PEWorker-27]
assignment.RegionStateStore: pid=313 updating hbase:meta
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING,
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO [PEWorker-27]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313,
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815,
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}
The "Skipping, no server for state=SPLIT" messages accumulate for the affected
split parents.
Not every split is affected.
In the master logs I see balancer and split activity co-occurring most of the
time. One may be interefering with the other. We seem to get into trouble only
when balancer activity and split procedure log lines are interlaced in the
master log.
was (Author: apurtell):
This warning repeats frequently and involves the split parents that accumulate
as bad state:
{noformat}
2021-04-30 19:35:01,115 INFO [PEWorker-27]
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815,
UNASSIGN
2021-04-30 19:35:01,116 INFO [PEWorker-27]
assignment.RegionStateStore: pid=313 updating hbase:meta
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING,
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO [PEWorker-27]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313,
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815,
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}
The "Skipping, no server for state=SPLIT" messages accumulate for the affected
split parents.
Not every split is affected.
In the master logs I see balancer and split activity co-occurring most of the
time. One may be interefering with the other.
> SPLIT state detritus
> --------------------
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.3
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test
> writes a log to HDFS of every cell written to HBase and the verify phase uses
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO
> level log messages printed of the form _assignment.RegionStates: Skipping, no
> server for state=SPLIT, location=null, table=TABLENAME_
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with
> server=null.
> {noformat}
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0
> orphan regions
> 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0
> orphan regions
> {noformat}
> This repeats indefinitely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)