[
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337628#comment-17337628
]
Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 9:25 PM:
-----------------------------------------------------------------------
I'm going to try out a patch, see if it helps. In theory this both excludes
regions in SPLITTING or SPLITTING_NEW from balancer attempts to mutate
assignments and skips a bunch of other uninteresting (as in not actively
assigned) states, reducing balancer workload and also avoiding unintended
consequences.
In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled
or isSplitParent. Both exclude some subset of regions that should not be
candidates for consideration, but why not get to the point?
{code}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
index d3553f11a3..64abd4e3ac 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
@@ -553,23 +552,25 @@ public class RegionStates {
* wants to iterate this exported list. We need to synchronize on regions
* since all access to this.servers is under a lock on this.regions.
*
- * @return A clone of current assignments.
+ * @return A clone of current open or opening assignments.
*/
public Map<TableName, Map<ServerName, List<RegionInfo>>>
getAssignmentsForBalancer(
TableStateManager tableStateManager, List<ServerName> onlineServers) {
final Map<TableName, Map<ServerName, List<RegionInfo>>> result = new
HashMap<>();
for (RegionStateNode node : regionsMap.values()) {
- if (isTableDisabled(tableStateManager, node.getTable())) {
- continue;
- }
- if (node.getRegionInfo().isSplitParent()) {
+ // When balancing, we are only interested in OPEN or OPENING regions and
expected
+ // to be online at that server until possibly the next balancer
iteration or unless
+ // we decide to move it. Other states are not interesting as the region
will either
+ // be closing, or splitting/merging, or will not be deployed.
+ if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) {
continue;
}
Map<ServerName, List<RegionInfo>> tableResult =
result.computeIfAbsent(node.getTable(), t -> new HashMap<>());
final ServerName serverName = node.getRegionLocation();
+ // A region in ONLINE or OPENING state should have a location.
if (serverName == null) {
- LOG.info("Skipping, no server for " + node);
+ LOG.warn("Skipping, no server for " + node);
continue;
}
List<RegionInfo> serverResult =
{code}
was (Author: apurtell):
I'm going to try out a patch, see if it helps.
In RegionStates#getAssignmentsForBalancer, we have a check for isTableDisabled
or isSplitParent. Both exclude some subset of regions that should not be
candidates for consideration, but why not get to the point?
{code}
diff --git
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
index d3553f11a3..64abd4e3ac 100644
---
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
+++
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionStates.java
@@ -553,23 +552,25 @@ public class RegionStates {
* wants to iterate this exported list. We need to synchronize on regions
* since all access to this.servers is under a lock on this.regions.
*
- * @return A clone of current assignments.
+ * @return A clone of current open or opening assignments.
*/
public Map<TableName, Map<ServerName, List<RegionInfo>>>
getAssignmentsForBalancer(
TableStateManager tableStateManager, List<ServerName> onlineServers) {
final Map<TableName, Map<ServerName, List<RegionInfo>>> result = new
HashMap<>();
for (RegionStateNode node : regionsMap.values()) {
- if (isTableDisabled(tableStateManager, node.getTable())) {
- continue;
- }
- if (node.getRegionInfo().isSplitParent()) {
+ // When balancing, we are only interested in OPEN or OPENING regions and
expected
+ // to be online at that server until possibly the next balancer
iteration or unless
+ // we decide to move it. Other states are not interesting as the region
will either
+ // be closing, or splitting/merging, or will not be deployed.
+ if (!(node.isInState(State.OPEN)||node.isInState(State.OPENING))) {
continue;
}
Map<ServerName, List<RegionInfo>> tableResult =
result.computeIfAbsent(node.getTable(), t -> new HashMap<>());
final ServerName serverName = node.getRegionLocation();
+ // A region in ONLINE or OPENING state should have a location.
if (serverName == null) {
- LOG.info("Skipping, no server for " + node);
+ LOG.warn("Skipping, no server for " + node);
continue;
}
List<RegionInfo> serverResult =
{code}
> SPLIT state detritus
> --------------------
>
> Key: HBASE-25829
> URL: https://issues.apache.org/jira/browse/HBASE-25829
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.3
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test
> writes a log to HDFS of every cell written to HBase and the verify phase uses
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO [main]
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO
> level log messages printed of the form _assignment.RegionStates: Skipping, no
> server for state=SPLIT, location=null, table=TABLENAME_
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO [master/ip-172-31-58-47:8100.Chore.2]
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null,
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with
> server=null.
> {noformat}
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0
> orphan regions
> 2021-04-30 02:02:09,306 INFO [master/ip-172-31-58-47:8100.Chore.1]
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0
> orphan regions
> {noformat}
> This repeats indefinitely.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)