[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:56 PM:
-----------------------------------------------------------------------

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from 
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:
{noformat}
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}
No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0.

Anyway, there is more to debug. Seems CatalogJanitor or the region GC 
procedures it submits are not completing their work. I have added some debug 
logging to CatalogJanitor to investigate further.


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is 
supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  
{{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged 
does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesystem and found 9 orphan 
regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan 
regions. There is a simple change to HbckChore that should accompany the other 
changes I have under test: The conditional for determining if a region is 
orphan should become {{if (hri == null && 
!splitParentRegions.contains(encodedRegionName) && 
!mergedParentRegions.contains(encodedRegionName))}} so you can ignore the 
non-zero orphan count. With the complete change it would be reported as 0. 

Anyway, there is more to debug. CatalogJanitor or the region GC procedures it 
submits may not be completing their work. I have added some debug logging to 
CatalogJanitor to investigate further. 

> SPLIT state detritus
> --------------------
>
>                 Key: HBASE-25829
>                 URL: https://issues.apache.org/jira/browse/HBASE-25829
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 
> orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to