[jira] [Commented] (HBASE-6165) Replication can overrun .META scans on cluster re-start

Himanshu Vashishtha (JIRA) Thu, 09 Aug 2012 11:10:00 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432029#comment-13432029
 ]


Himanshu Vashishtha commented on HBASE-6165:
--------------------------------------------

[~eclark]: I used custom, because the current naming scheme is not appropriate 
in my opinion (I started with medium/semi QOS, but then changed it to Custom). 
Using priority is kind of a misnomer as there is no priority as such, its just 
different set of handlers that is serving the requests.
Though we call them priorityHandlers, etc, they are just like regular handlers 
but for meta operations. I think we should change their name to metaOpsHandlers 
(or metaHandlers). Yea, I just used a threshold b/w 0 and 10.

bq. Since this starts 0 "custom" priority handlers by default it will add 
another undocumented step when enabling replication. We should either make the 
number of handlers start by default > 0, or have the number depend on if 
replication is enabled.
I am ok with >0 default; don't think it should be tied to replication as they 
can be used for other methods too (such as Security, etc)

@Lars: 
bq. The naming is weird. These are not "Custom"QOS, but "Medium"QOS methods, 
right?
Hope you find it rationale now.

bq. By default now (if hbase.regionserver.custom.priority.handler.count is not 
set), replicateWALEntry would use non-priority handlers... Which is not right, 
I think. It should revert back to the current behavior in that case (which is 
to do use the priorityQOS.
default > 0 sounds good?


bq. What I still do not understand... Does this problem always happen? Does it 
happen because replicateWALEntry takes too long to finish? Does this only 
happen when the slave is already degraded for other reasons? Should we also 
work on replicateWALEntry failing faster in case of problems (shorter/fewer 
retries, etc)?

It can occur when the slave cluster is slow. And whenever it happens, it will 
make the entire cluster unresponsive. I have a patch which adds the fail fast 
behavior in sink and has been testing it too. It looks good so far. I tried 
creating a new JIRA but IOE while creating it (see INFRA-5131). Will attach the 
patch once its created.
                
> Replication can overrun .META scans on cluster re-start
> -------------------------------------------------------
>
>                 Key: HBASE-6165
>                 URL: https://issues.apache.org/jira/browse/HBASE-6165
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Elliott Clark
>         Attachments: HBase-6165-v1.patch
>
>
> When restarting a large set of regions on a reasonably small cluster the 
> replication from another cluster tied up every xceiver meaning nothing could 
> be onlined.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6165) Replication can overrun .META scans on cluster re-start

Reply via email to