Ashu Pachauri created HBASE-14802:
-------------------------------------
Summary: Replaying server crash recovery procedure after a
failover causes incorrect handling of deadservers
Key: HBASE-14802
URL: https://issues.apache.org/jira/browse/HBASE-14802
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 2.0.0, 1.2.0, 1.2.1
Reporter: Ashu Pachauri
Assignee: Ashu Pachauri
The way dead servers are processed is that a ServerCrashProcedure is launched
for a server after it is added to the dead servers list.
Every time a server is added to the dead list, a counter "numProcessing" is
incremented and it is decremented when a crash recovery procedure finishes.
Since, adding a dead server and recovering it are two separate events, it can
cause inconsistencies.
If a master failover occurs in the middle of the crash recovery, the
numProcessing counter resets but the ServerCrashProcedure is replayed by the
new master. This causes the counter to go negative and makes the master think
that dead servers are still in process of recovery.
This has ramifications on the balancer that the balancer ceases to run after
such a failover.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)