Sergey Shelukhin created HBASE-21742:
----------------------------------------
Summary: master can create bad procedures during abort, making
entire cluster unusable
Key: HBASE-21742
URL: https://issues.apache.org/jira/browse/HBASE-21742
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
Some small HDFS hiccup causes master and meta RS to fail together. Master goes
first:
{noformat}
2019-01-18 08:09:46,790 INFO [KeepAlivePEWorker-311]
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING
master master,17000,1547604554447: FAILED [blah] *****
...
2019-01-18 10:01:17,087 INFO [master/master:17000]
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO [PEWorker-3] procedure2.TimeoutExecutorThread:
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ...
{noformat}
{noformat}
Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO [RegionServerTracker-0]
master.RegionServerTracker: RegionServer ephemeral node deleted, processing
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO [RegionServerTracker-0]
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
This SCP gets persisted, so when the next master starts, it waits forever for
meta to be onlined, while there's no SCP with meta=true to online it.
The only way around this is to delete the procv2 WAL - master has all the
information here, as it often does in bugs I've found recently, but some split
brain procedures cause it to get stuck one way or another.
I will file a separate bug about that.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)