[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold

Duo Zhang (JIRA) Sun, 19 Jul 2015 20:48:07 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633048#comment-14633048
 ]


Duo Zhang commented on HBASE-14059:
-----------------------------------

{quote}
In the issue I ran into it was a bad region causing the RS to be blocked for a 
long time. 
{quote}

More details? What does 'bad' mean? And you said the region is back to normal 
when you kill the RS, so I think there maybe another bug?

In general, I agree with you, we should offline a RS if admin calls always 
fail. But it should be used to fix a 'bad' RS, not a 'bad' region. If there is 
a 'bad' region that can not be fixed by reassign, then as [~chenheng] said, the 
'bad' region will kill all regionservers in your cluster...

Thanks.

> We should add a RS to the dead servers list if admin calls fail more than a 
> threshold
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-14059
>                 URL: https://issues.apache.org/jira/browse/HBASE-14059
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver, rpc
>    Affects Versions: 0.98.13
>            Reporter: Esteban Gutierrez
>            Assignee: Esteban Gutierrez
>            Priority: Critical
>
> I ran into this problem twice this week: calls from the HBase master to a RS 
> can timeout since the RS call queue size has been maxed out, however since 
> the RS is not dead (ephemeral znode still present) the master keeps 
> attempting to perform admin tasks like trying to open or close a region but 
> those operations eventually fail after we run out of retries or the 
> assignment manager attempts to re-assign to other RSs. From the side effects 
> of this I've noticed master operations to be fully blocked or RITs since we 
> cannot close the region and open the region in a new location since RS is not 
> dead. 
> A potential solution for this is to add the RS to the list of dead RSs after 
> certain number of calls from the master to the RS fail.
> I've noticed only the problem in 0.98.x but it should be present in all 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold

Reply via email to