[
https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596800#comment-16596800
]
Andrew Purtell commented on HBASE-18549:
----------------------------------------
+1
What do you think about a patch for branch-1 too? [~xucang]
> Unclaimed replication queues can go undetected
> ----------------------------------------------
>
> Key: HBASE-18549
> URL: https://issues.apache.org/jira/browse/HBASE-18549
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Ashu Pachauri
> Assignee: Xu Cang
> Priority: Critical
> Fix For: 1.5.0, 1.3.3, 1.4.8
>
> Attachments: HBASE-18549-.master.001.patch,
> HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch
>
>
> We have come across this situation multiple times where a zookeeper issues
> can cause NodeFailoverWorker to fail picking up replication queue for a dead
> region server silently. One example is when the znode size for a particular
> queue exceed jute.maxBuffer value.
> There can be other situations that may lead to this and just go undetected.
> We need to have a metric for number of unclaimed replication queues. This
> will help in mitigating the problem through alerting on the metric and
> identifying underlying issues.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)