[jira] [Commented] (HBASE-14802) Replaying server crash recovery procedure after a failover causes incorrect handling of deadservers

Ashu Pachauri (JIRA) Sun, 15 Nov 2015 11:02:57 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-14802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005992#comment-15005992
 ]


Ashu Pachauri commented on HBASE-14802:
---------------------------------------

I don't see why starting the mini cluster should time out only for this test if 
it has something to do with the change itself. The only thing I can think of is 
that the test puts a timeout of 15 seconds and the mini cluster fails to come 
up in 15 seconds. I will move the mini cluster startup to the setup stage, 
rather than inside the test itself.

> Replaying server crash recovery procedure after a failover causes incorrect 
> handling of deadservers
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14802
>                 URL: https://issues.apache.org/jira/browse/HBASE-14802
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.0.0, 1.2.0, 1.2.1
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: HBASE-14802-1.patch, HBASE-14802-2.patch, 
> HBASE-14802-3.patch, HBASE-14802.patch
>
>
> The way dead servers are processed is that a ServerCrashProcedure is launched 
> for a server after it is added to the dead servers list. 
> Every time a server is added to the dead list, a counter "numProcessing" is 
> incremented and it is decremented when a crash recovery procedure finishes. 
> Since, adding a dead server and recovering it are two separate events, it can 
> cause inconsistencies.
> If a master failover occurs in the middle of the crash recovery, the 
> numProcessing counter resets but the ServerCrashProcedure is replayed by the 
> new master. This causes the counter to go negative and makes the master think 
> that dead servers are still in process of recovery. 
> This has ramifications on the balancer that the balancer ceases to run after 
> such a failover.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14802) Replaying server crash recovery procedure after a failover causes incorrect handling of deadservers

Reply via email to