Sergey Shelukhin created HBASE-21742:
----------------------------------------

             Summary: master can create bad procedures during abort, making 
entire cluster unusable
                 Key: HBASE-21742
                 URL: https://issues.apache.org/jira/browse/HBASE-21742
             Project: HBase
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Some small HDFS hiccup causes master and meta RS to fail together. Master goes 
first:
{noformat}
2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
ZooKeeper as meta-rs,17020,1547824792484
...
2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING 
master master,17000,1547604554447: FAILED [blah] *****
...
2019-01-18 10:01:17,087 INFO  [master/master:17000] 
assignment.AssignmentManager: Stopping assignment manager
{noformat}
Bunch of stuff keeps happening, including procedure retries, which is also 
suspect, but not the point here:
{noformat}
2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
{noformat}
{noformat}

Then the meta RS decides it's time to go:
{noformat}
2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [meta-rs,17020,1547824792484]
...
2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers 
which carryingMeta=false, submitted ServerCrashProcedure pid=104313
{noformat}
This SCP gets persisted, so when the next master starts, it waits forever for 
meta to be onlined, while there's no SCP with meta=true to online it.

The only way around this is to delete the procv2 WAL - master has all the 
information here, as it often does in bugs I've found recently, but some split 
brain procedures cause it to get stuck one way or another.

I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to