[
https://issues.apache.org/jira/browse/HBASE-26193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398443#comment-17398443
]
Duo Zhang commented on HBASE-26193:
-----------------------------------
This could solve part of the problem but not all. IIRC, there is still another
problem is that, we need the WAL directory of a region server to be present
even if flushed all the data and there is no data under the WAL directory. When
starting master, we will scan the WAL directory to get the region servers, and
then compare it with the region server list on zookeeper, to find out the dead
servers. So if the WAL file system is also cleaned, the cluster will be in
trouble too, as it can not find out dead servers, and also will not schedule
SCPs.
The reason why we use the WAL filesystem to get the region server list is that,
we need to use SCP to bring a region online, as well as meta region, so we can
not rely on scanning meta region to find out the region server list, otherwise
there will be cyclic dependency, this is also the reason why we remove the
AssignMetaProcedure, it does not always work. For now, if we only have one meta
region, maybe it is possible to do some hacks to find out the region server for
meta region, but if later we want to support meta split, things will become
much more difficult.
But anyway, it will be good if we do not need to rely on WAL filesystem
structures when starting up. If anyone has some ideas on how to improve, I'm
happy to help reviewing if it actually works.
Thanks.
> Do not store meta region location on zookeeper
> ----------------------------------------------
>
> Key: HBASE-26193
> URL: https://issues.apache.org/jira/browse/HBASE-26193
> Project: HBase
> Issue Type: Improvement
> Components: meta, Zookeeper
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Major
>
> As it breaks one of our design rules
> https://hbase.apache.org/book.html#design.invariants.zk.data
> We used to think hbase should be recovered automatically when all the data on
> zk (except the replication data) are cleared, but obviously, if you clear the
> meta region location, the cluster will be in trouble, and need to use
> operation tools to recover the cluster.
> So here, along with the ConnectionRegistry improvements, we should also
> consider move the meta region location off zookeeper.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)