[ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760133#comment-16760133
 ] 

Wellington Chevreuil commented on HBASE-21843:
----------------------------------------------

Looked at the server logs to try understand it. Here's my conclusion on what 
led to that state:
1) SCP processes log split before Assign the regions that were on the crashed 
server;
2) While doing log split, it first renamed the WAL dir to add "-splitting" 
suffix, then it didn't find any files on that WAL dir and removed the dir. At 
this point, there was no WAL dir for RS1-T1 anymore.
3) SCP continues to SERVER_CRASH_ASSIGN. It all goes well, but just before 
updating meta with the new RS assignment, hdfs enters safemode, the meta update 
fails, whole hbase cluster crashes. Now we have meta still with the original 
RS1-T1 assigned, but there's no more WAL dir for it.

> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to