[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337690#comment-17337690
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 4/30/21, 11:41 PM:
------------------------------------------------------------------------

This addresses the issue of multiple split request transaction submissions.

{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 107330d90b..48cc26086f 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1119,7 +1119,13 @@ public class AssignmentManager {
       LOG.debug("Split request from " + serverName +
           ", parent=" + parent + " splitKey=" + 
Bytes.toStringBinary(splitKey));
     }
-    
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+    if (regionStates.getRegionState(parent).isOpened() &&
+          !regionStates.getRegionState(parent).isSplitting()) {
+      
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+    } else {
+      LOG.warn("Ignoring split request from " + serverName +
+        ", parent=" + parent + " because parent is already splitting or not 
online");
+    }
 
     // If the RS is < 2.0 throw an exception to abort the operation, we are 
handling the split
     if (master.getServerManager().getVersionNumber(serverName) < 0x0200000) { 
{code}

Is this the complete fix, though? 

With this patch in place, now we just have in the master log:

{noformat}
2021-04-30 23:22:14,971 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100]
assignment.AssignmentManager: Ignoring split request from 
ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED 
=> df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 
'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.',
 STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online
{noformat}

but the first submission of the split request is already in progress and 
completes just fine. 


was (Author: apurtell):
This addresses the issue of multiple split request transaction submissions.

{code:java}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 107330d90b..48cc26086f 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1119,7 +1119,13 @@ public class AssignmentManager {
       LOG.debug("Split request from " + serverName +
           ", parent=" + parent + " splitKey=" + 
Bytes.toStringBinary(splitKey));
     }
-    
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+    if (regionStates.getRegionState(parent).isOpened() &&
+          !regionStates.getRegionState(parent).isSplitting()) {
+      
master.getMasterProcedureExecutor().submitProcedure(createSplitProcedure(parent,
 splitKey));
+    } else {
+      LOG.warn("Ignoring split request from " + serverName +
+        ", parent=" + parent + " because parent is already splitting or not 
online");
+    }
 
     // If the RS is < 2.0 throw an exception to abort the operation, we are 
handling the split
     if (master.getServerManager().getVersionNumber(serverName) < 0x0200000) { 
{code}

Is this the complete fix, though? Should the RS be submitting this report more 
than once? 

With this patch in place, now we just have in the master log:

{noformat}
2021-04-30 23:22:14,971 WARN  
[RpcServer.priority.RWQ.Codel.write.handler=1,queue=0,port=8100]
assignment.AssignmentManager: Ignoring split request from 
ip-172-31-63-83.us-west-2.compute.internal,8120,1619824775800, parent={ENCODED 
=> df7aa0e0af5a2b757ad86f2cf051fcbb, NAME => 
'IntegrationTestLoadCommonCrawl,,1619824793285.df7aa0e0af5a2b757ad86f2cf051fcbb.',
 STARTKEY => '', ENDKEY => ''} because parent is already splitting or not online
{noformat}

but the first submission of the split request is already in progress and 
completes just fine. 

> SPLIT state detritus
> --------------------
>
>                 Key: HBASE-25829
>                 URL: https://issues.apache.org/jira/browse/HBASE-25829
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 
> orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to