[ https://issues.apache.org/jira/browse/HBASE-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432029#comment-13432029 ]
Himanshu Vashishtha commented on HBASE-6165: -------------------------------------------- [~eclark]: I used custom, because the current naming scheme is not appropriate in my opinion (I started with medium/semi QOS, but then changed it to Custom). Using priority is kind of a misnomer as there is no priority as such, its just different set of handlers that is serving the requests. Though we call them priorityHandlers, etc, they are just like regular handlers but for meta operations. I think we should change their name to metaOpsHandlers (or metaHandlers). Yea, I just used a threshold b/w 0 and 10. bq. Since this starts 0 "custom" priority handlers by default it will add another undocumented step when enabling replication. We should either make the number of handlers start by default > 0, or have the number depend on if replication is enabled. I am ok with >0 default; don't think it should be tied to replication as they can be used for other methods too (such as Security, etc) @Lars: bq. The naming is weird. These are not "Custom"QOS, but "Medium"QOS methods, right? Hope you find it rationale now. bq. By default now (if hbase.regionserver.custom.priority.handler.count is not set), replicateWALEntry would use non-priority handlers... Which is not right, I think. It should revert back to the current behavior in that case (which is to do use the priorityQOS. default > 0 sounds good? bq. What I still do not understand... Does this problem always happen? Does it happen because replicateWALEntry takes too long to finish? Does this only happen when the slave is already degraded for other reasons? Should we also work on replicateWALEntry failing faster in case of problems (shorter/fewer retries, etc)? It can occur when the slave cluster is slow. And whenever it happens, it will make the entire cluster unresponsive. I have a patch which adds the fail fast behavior in sink and has been testing it too. It looks good so far. I tried creating a new JIRA but IOE while creating it (see INFRA-5131). Will attach the patch once its created. > Replication can overrun .META scans on cluster re-start > ------------------------------------------------------- > > Key: HBASE-6165 > URL: https://issues.apache.org/jira/browse/HBASE-6165 > Project: HBase > Issue Type: Bug > Reporter: Elliott Clark > Attachments: HBase-6165-v1.patch > > > When restarting a large set of regions on a reasonably small cluster the > replication from another cluster tied up every xceiver meaning nothing could > be onlined. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira