[ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761350#comment-16761350
 ] 

Duo Zhang commented on HBASE-21843:
-----------------------------------

Wait a minute

{noformat}
2019-02-01 13:59:52,304 INFO org.apache.hadoop.hbase.wal.WALSplitter: Processed 
37 edits across 1 regions; edits skipped=3; log 
file=hdfs://wellington-1.example.com:8020/hbase/WALs/wellington-4.example.com,22101,1548981048686-splitting/wellington-4.example.com%2C22101%2C1548981048686.wellington-4.example.com%2C22101%2C1548981048686.meta.regiongroup-0.1548981183234,
 length=24420, corrupted=false, progress failed=false
{noformat}

The filename looks really strange, I believe it is the meta wal file?

And when splitting wals, we use this filter to filter out the meta wals first 
so we can make meta online first.
{code}
  public static boolean isMetaFile(String p) {
    if (p != null && p.endsWith(META_WAL_PROVIDER_ID)) {
      return true;
    }
    return false;
  }
{code}

Obviously the above file name will be filtered out by this method, so we just 
make meta online without actually writing recovered.edits for it. This is why 
the newest updates to meta are gone.




> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to