[ 
https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338650#comment-17338650
 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/4/21, 5:08 PM:
----------------------------------------------------------------------

Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

[Edit: Removed some distraction.]

Remains to be seen if CatalogJanitor will clean the regions. Will update.


was (Author: apurtell):
Subtasks look good. Back to the main issue.
{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 
orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesystem and found 0 
orphan regions
{noformat}
The 51 extra regions are SPLIT parents, with server = null.

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents 
from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit 
we do not. We have code in various places that account for a post-split parent 
to be hanging out in {{regionStates}} in SPLIT state. CatalogJanitor is 
supposed to clean it.

[Edit: Removed some distraction.]

Remains to be seen if CatalogJanitor will clean the regions. Probably. What is 
happening during latest tests is because isRIT evaluates to true usually (it's 
an ingest test after all) and HBASE-25840 is applied all of the cleanup is 
deferred. Will update and see if region GC proceeds as expected.

> SPLIT state detritus
> --------------------
>
>                 Key: HBASE-25829
>                 URL: https://issues.apache.org/jira/browse/HBASE-25829
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this 
> happened in the happy path.
> There were no errors accessing all loaded table data. The integration test 
> writes a log to HDFS of every cell written to HBase and the verify phase uses 
> that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] 
> test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO 
> level log messages printed of the form _assignment.RegionStates: Skipping, no 
> server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, 
> table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with 
> server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 
> orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 
> orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to