The issue aims to solve the problem of redeploying HBase clusters on cloud.

I can not find the issue but IIRC, the AWS guys said they tried to do the
following steps while redeploying a customer's HBase cluster:

1. Disable write to cluster, flush all data to disk(which is actually S3)
2. Recreate the cluster with a set of new machines, and also a new zk and a
new HDFS(for writing WAL)

Then the new cluster just hung there and no regions were online.

This is because in HMaster startup, we rely on scanning the WAL directory
on HDFS to get the previous live region servers, and we will compare the
list with the list stored on zookeeper to find out dead region servers and
schedule SCPs for them, and then the SCPs will bring the regions online.

The problem for the above redeploying operation is, the WAL directory is
also cleaned, so we can not get the previous live region servers, so no SCP
will be scheduled.

This is a bit annoying as we have already flushed all the data out so it
should be safe to delete all the WAL data.

The idea in HBASE-26245 is to also store a copy of the live region servers
in master local region, so when restarting, we could also load the previous
live region servers from master local region, instead of only relying on
the WAL directory. In this way we could solve the problem of the above
redeploying operation.

The PR is also ready.

https://github.com/apache/hbase/pull/4136

Suggestions and reviews are always welcomed.

Thanks.

Reply via email to