[ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760627#comment-16760627
 ] 

Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------

Hey [~stack], [~Apache9],

bq. The SCP doesn't get rerun? The step w/ meta assign doesn't get rerun 
because it hasn't completed yet?

Actually, the SCP is marked as success just before the crash. So if the SCP was 
success, I guess the meta update should had been completed, but on the cluster 
restart, meta was pointing again to the older startcode. I'm not sure how, 
maybe wal replay for meta here didnt go well? 

bq. Is this the same with HBASE-21844? The HDFS is gone, and when it comes 
back, we have ‘lost’ several procedures.
I guess these are a bit different. Here, meta does come online, but since some 
regions are still marked as 'open' on an RS with old startcode for which 
there's no pending SCP, and no WAL dir anymore, these regions are never 
assigned again.

> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to