[ 
https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590470#comment-16590470
 ] 

Andrew Purtell commented on HBASE-18549:
----------------------------------------

bq. my takeaway from this JIRA's description is to providing metrics regarding 
failed to recover replication queue.

Yes i believe that as well.

The patch mostly lgtm. The word region is misspelled in some exception text. 
That exception could provide more information for debugging. You have:
{code}
@@ -891,7 +894,10 @@ public class ReplicationSourceManager implements 
ReplicationListener {
           queueStorage.removeReplicatorIfQueueIsEmpty(deadRS);
         }
       } catch (ReplicationException e) {
-        server.abort("Failed to claim queue from dead regionserver", e);
+        LOG.error("ReplicationException: cannot claim dead reagion's 
replication queue," +
+            " possible solution: check if znode size exceeds jute.maxBuffer 
value. " +
+            " If so, increase it for both client and server side.");
+        server.abort("Failed to claim queue from dead regionserver.", e);
         return;
       }
       // Copying over the failed queue is completed.
{code}

Add the region name, the znode path, and the stacktrace to the log line. 

> Unclaimed replication queues can go undetected
> ----------------------------------------------
>
>                 Key: HBASE-18549
>                 URL: https://issues.apache.org/jira/browse/HBASE-18549
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Ashu Pachauri
>            Assignee: Xu Cang
>            Priority: Critical
>             Fix For: 1.5.0, 1.3.3, 1.4.8
>
>         Attachments: HBASE-18549-.master.001.patch
>
>
> We have come across this situation multiple times where a zookeeper issues 
> can cause NodeFailoverWorker to fail picking up replication queue for a dead 
> region server silently. One example is when the znode size for a particular 
> queue exceed jute.maxBuffer value.
> There can be other situations that may lead to this and just go undetected. 
> We need to have a metric for number of unclaimed replication queues. This 
> will help in mitigating the problem through alerting on the metric and 
> identifying underlying issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to