Ashu Pachauri created HBASE-18549: ------------------------------------- Summary: Unclaimed replication queues can go undetected Key: HBASE-18549 URL: https://issues.apache.org/jira/browse/HBASE-18549 Project: HBase Issue Type: Bug Components: Replication Reporter: Ashu Pachauri Priority: Critical Fix For: 1.3.2
We have come across this situation multiple times where a zookeeper issues can cause NodeFailoverWorker to fail picking up replication queue for a dead region server silently. One example is when the znode size for a particular queue exceed jute.maxBuffer value. There can be other situations that may lead to this and just go undetected. We need to have a metric for number of unclaimed replication queues. This will help in mitigating the problem through alerting on the metric and identifying underlying issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)