[
https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589751#comment-16589751
]
Xu Cang commented on HBASE-18549:
---------------------------------
my takeaway from this JIRA's description is to providing metrics regarding
failed to recover replication queue.
The reason of this failure could be ZK node oversize or other unknown issue.
For ZK issue, the only solution is to increase its size via "jute.maxBuffer"
from both client and server side. So, I am not changing any HBase code to try
to handle it. Simply increase the count for this new metric and print out error
to log to alert operators.
Metrics look like this after this patch:
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
"modelerType" : "RegionServer,sub=Replication",
"tag.Context" : "regionserver",
"tag.Hostname" : "xcang-ltm.internal.salesforce.com",
"source.shippedHFiles" : 0,
"Source.ageOfLastShippedOp_num_ops" : 0,
...
"source.completedRecoverQueues" : 0,
...
*"source.failedRecoverQueues" : 0, (this metric is in global source, so
it's resilient to any particular source going down)*
> Unclaimed replication queues can go undetected
> ----------------------------------------------
>
> Key: HBASE-18549
> URL: https://issues.apache.org/jira/browse/HBASE-18549
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Ashu Pachauri
> Assignee: Xu Cang
> Priority: Critical
> Fix For: 1.5.0, 1.3.3, 1.4.8
>
> Attachments: HBASE-18549-.master.wip.patch
>
>
> We have come across this situation multiple times where a zookeeper issues
> can cause NodeFailoverWorker to fail picking up replication queue for a dead
> region server silently. One example is when the znode size for a particular
> queue exceed jute.maxBuffer value.
> There can be other situations that may lead to this and just go undetected.
> We need to have a metric for number of unclaimed replication queues. This
> will help in mitigating the problem through alerting on the metric and
> identifying underlying issues.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)