taklwu edited a comment on pull request #2113:
URL: https://github.com/apache/hbase/pull/2113#issuecomment-662590771
Thanks Josh, and honestly I didn't know the logic till now. And here is the
finding for both sitation you're concerning:
#### first case
1. hbase:meta has assigned regions to a set of RegionServers rs1
2. All hosts of rs1 are shutdown and destroyed (i.e. meta still contains
references to them)
3. A new set of RegionServers are created, rs2, which have completely
unique hostnames to rs1
4. All MasterProcWALs from the cluster with rs1 are lost.
#### second case
1. I have a healthy cluster (1 master, many RS)
2. I stop the master
3. I kill one RS
3a. I do not restart that RS
4. I restart the master
There is three Key parts in the normal system to handle `region server has
been deleted`, MasterProcWALs/MasterRegion for `DEAD` server being tracked by
SCP, Region servers name exists in WAL for `possibly live` servers.
If MasterProcWALs/MasterRegion both exist after a cluster restarts, when
`RegionServerTracker` starts, `RegionServerTracker` figures out all online
servers, and if we don't have Znode (with same hostname when restart?) for
`possibly live` servers, marked they are dead and scheduled SCP for it as well
as continue the SCP for already dead servers. That would be normal cases.
```
2020-07-22 09:55:24,729 INFO [master/localhost:0:becomeActiveMaster]
master.RegionServerTracker(123): Starting RegionServerTracker; 0 have existing
ServerCrashProcedures, 3 possibly 'live' servers, and 0 'splitting'.
2020-07-22 09:55:24,730 DEBUG [master/localhost:0:becomeActiveMaster]
zookeeper.RecoverableZooKeeper(183): Node
/hbase/draining/localhost,55572,1595436917066 already deleted, retry=false
2020-07-22 09:55:24,730 INFO [master/localhost:0:becomeActiveMaster]
master.ServerManager(585): Processing expiration of
localhost,55572,1595436917066 on localhost,55667,1595436924374
2020-07-22 09:55:24,755 DEBUG [master/localhost:0:becomeActiveMaster]
procedure2.ProcedureExecutor(1050): Stored pid=12,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
localhost,55572,1595436917066, splitWal=true, meta=true
```
Then in the case of deleting MasterProcWALs (or MasterRegion in branch-2.3+)
and kept the ZK nodes, even there is no procedure MasterProcWALs restored from,
as long as we have the WAL from for previous host, we can still schedule SCP
for it. but if MasterProcWALs and WAL are deleted, neither of the first and
second cases will not operating normally.
The case we were originally trying to solve that is falling into the
situation of MasterProcWALs and WAL are deleted after cluster restarted, we
don't have the WAL, MasterProcWALs/MasterRegion and Zookeeper but HFiles, then
those servers are under unknown and regions cannot be reassigned.
----
About the unit tests failure, Now....I'm hitting a strange issue, my tests
works fine if I delete WAL, MasterProcWALs, and ZK baseZNode in branch-2.2.
However, with the same setup in branch-2.3+ and master will hangs the master
initialization if the ZK baseZNode is deleted with or without my changes. (what
has been changed in branch-2.3? I found MasterRegion but not sure why that's
related to ZK data, is it a bug? )
Interestingly, my fix works if keep the baseZnode, so, I'm trying to figure
out a right way to cleanup zookeeper such it matched the one of the cloud use
cases that WAL on HDFS and ZK are also deleted when HBase cluster terminated.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]