[
https://issues.apache.org/jira/browse/HBASE-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack reopened HBASE-14802:
---------------------------
Reverted from master branch. Since this went in, the related TestDeadServers is
failing.
See
https://builds.apache.org/job/HBase-Trunk_matrix/lastCompletedBuild/jdk=latest1.8,label=Hadoop/testReport/org.apache.hadoop.hbase.master/TestDeadServer/testCrashProcedureReplay/history/
Failed twice in master jdk8 builds (passed once).
Failed once on jdk7
https://builds.apache.org/job/HBase-Trunk_matrix/464/jdk=latest1.7,label=Hadoop/testReport/junit/org.apache.hadoop.hbase.master/TestDeadServer/testCrashProcedureReplay/history/
Seems to be just timing out the start up which is odd.
I have not reverted from branch-1 or branch-1.2 because have had a successful
build in the latter. In the former it seemed to be something else.
> Replaying server crash recovery procedure after a failover causes incorrect
> handling of deadservers
> ---------------------------------------------------------------------------------------------------
>
> Key: HBASE-14802
> URL: https://issues.apache.org/jira/browse/HBASE-14802
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 2.0.0, 1.2.0, 1.2.1
> Reporter: Ashu Pachauri
> Assignee: Ashu Pachauri
> Fix For: 2.0.0, 1.2.0, 1.3.0
>
> Attachments: HBASE-14802-1.patch, HBASE-14802-2.patch,
> HBASE-14802-3.patch, HBASE-14802.patch
>
>
> The way dead servers are processed is that a ServerCrashProcedure is launched
> for a server after it is added to the dead servers list.
> Every time a server is added to the dead list, a counter "numProcessing" is
> incremented and it is decremented when a crash recovery procedure finishes.
> Since, adding a dead server and recovering it are two separate events, it can
> cause inconsistencies.
> If a master failover occurs in the middle of the crash recovery, the
> numProcessing counter resets but the ServerCrashProcedure is replayed by the
> new master. This causes the counter to go negative and makes the master think
> that dead servers are still in process of recovery.
> This has ramifications on the balancer that the balancer ceases to run after
> such a failover.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)