[
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337591#comment-17337591
]
Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 8:00 PM:
-----------------------------------------------------------------------
This warning repeats frequently and involves the split parents that accumulate
as bad state:
{noformat}
2021-04-30 19:35:01,115 INFO [PEWorker-27]
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815,
UNASSIGN
2021-04-30 19:35:01,116 INFO [PEWorker-27]
assignment.RegionStateStore: pid=313 updating hbase:meta
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING,
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO [PEWorker-27]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313,
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815,
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}
The "Skipping, no server for state=SPLIT" messages accumulate for the affected
split parents.
Not every split is affected.
In the master logs I see balancer and split activity co-occurring most of the
time. One may be interefering with the other.
was (Author: apurtell):
This warning happens at every split:
{noformat}
2021-04-30 19:35:01,115 INFO [PEWorker-27]
procedure.MasterProcedureScheduler: Took xlock for pid=313, ppid=312,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure
table=IntegrationTestLoadCommonCrawl, region=c9076fb746a0690900130cd5d38f1815,
UNASSIGN
2021-04-30 19:35:01,116 INFO [PEWorker-27]
assignment.RegionStateStore: pid=313 updating hbase:meta
row=c9076fb746a0690900130cd5d38f1815, regionState=CLOSING,
regionLocation=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832
2021-04-30 19:35:01,116 INFO [PEWorker-27]
procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=314, ppid=313,
state=RUNNABLE; CloseRegionProcedure c9076fb746a0690900130cd5d38f1815,
server=ip-172-31-49-8.us-west-2.compute.internal,8120,1619809723832}]
2021-04-30 19:35:05,546 WARN
[RpcServer.priority.RWQ.Codel.write.handler=0,queue=0,port=8100]
assignment.AssignmentManager: Failed transition
org.apache.hadoop.hbase.client.DoNotRetryRegionException:
c9076fb746a0690900130cd5d38f1815 is not OPEN; state=CLOSING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.checkOnline(RegionStateNode.java:291)
at
org.apache.hadoop.hbase.master.procedure.AbstractStateMachineTableProcedure.checkOnline(AbstractStateMachineTableProcedure.java:194)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.<init>(SplitTableRegionProcedure.java:112)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createSplitProcedure(AssignmentManager.java:939)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.updateRegionSplitTransition(AssignmentManager.java:1122)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:984)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportRegionStateTransition(AssignmentManager.java:1019)
at
org.apache.hadoop.hbase.master.MasterRpcServices.reportRegionStateTransition(MasterRpcServices.java:1727)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16185)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}
And the "Skipping, no server for state=SPLIT" messages accumulate for every
split parent. If there are 10 splits in a test run, then every balancer
iteration we will log that line for 10 regions (the split parents).
> SPLIT state detritus
> --------------------
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.3
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test
> writes a log to HDFS of every cell written to HBase and the verify phase uses
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO
> level log messages printed of the form _assignment.RegionStates: Skipping, no
> server for state=SPLIT, location=null, table=TABLENAME_
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with
> server=null.
> {noformat}
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0
> orphan regions
> 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0
> orphan regions
> {noformat}
> This repeats indefinitely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)