[ 
https://issues.apache.org/jira/browse/HBASE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677304#comment-15677304
 ] 

Gary Helmling commented on HBASE-17114:
---------------------------------------

bq. Well, if checking the uploaded patch, it's indeed tied to CQTBE only. 
Introducing a new property is only for making things more flexible, and of 
course we could use a hard-coded, like 5 times than the existing pause, for 
CQTBE. But I'd say this is a trade-off, waiting longer for CQTBE could prevent 
the vicious circle but is also causing a higher latency, and IMHO user should 
be able to control such trade-off. If they don't want CQTBE to be special, they 
could set hbase.client.pause.special to the same value as hbase.client.pause, 
which gives them more options.

I agree with allowing the user to control the behavior here, but this is also 
increasing complexity and knowledge needed for configuration tuning, which we 
already have way too much of.  In general, we should be moving in the direction 
of making the system dynamically tune itself according to load instead of 
forcing all users to grapple with yet another configuration property.  By 
default the configuration should be simple to provide the best experience to 
all users.  For advanced users who really need to treat CQTBE differently, that 
should be possible by means of an override, but should not be forced on 
everyone.

bq. Sorry but I don't see any difference in "should not clear the client meta 
cache" and "should not retry so frequently", both trying to resolve some 
problem and make things better.

These are two completely different things.  I don't see the equivalence.  We 
don't clear the meta cache because we don't have an indication that the region 
has moved, so there is no need to go back to meta.  The meta cache handling is 
completely independent of what is appropriate in terms of retries.

bq. No offense but I'm even thinking of making CQTBE thrown optional, because 
for some case dead-wait for the request to be executed in RpcServer until 
time-out is preferable by user rather than receiving some exception and retry 
and fail again, but obviously this is another topic (Smile).

Blocking the RpcServer Reader threads indefinitely when the queue is full, 
making the server completely unresponsive and spilling overflow back in to the 
OS networking buffers is pretty poor behavior.  CQTBE is a crude mechanism for 
back-pressure to the client, but at least it gets the client a response and 
allows it to make an informed decision about how to proceed.  In the case where 
the application implements its own retries the client may want to simply fail 
and kick the exception back up the stack, allowing other layers to retry.  Or 
the client could decide to retry for a fixed duration.  But in either case I 
think CQTBE provides a very clear improvement in overall server behavior.  
Another part of the puzzle is the CoDel scheduler which will allow more useful 
work to get done in overloaded situations.

I'm all for improving the client/server interactions in these scenarios, and 
what I first outlined in this issue was one idea for how to do that more 
effectively.  However, I would also like us to avoid unexpected surprises for 
our users, and regressions in server behavior.

I'm not sure of the exact symptoms you're trying to solve, but if you're seeing 
issues with meta being overloaded, then I'd suggest tuning the configuration 
for the number of priority handlers and size of the priority queues.  You could 
also evaluate running with meta hosted on master, which together with zk-less 
assignment can make region assignment much more stable.

> Add an option to set special retry pause when encountering 
> CallQueueTooBigException
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-17114
>                 URL: https://issues.apache.org/jira/browse/HBASE-17114
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Yu Li
>            Assignee: Yu Li
>         Attachments: HBASE-17114.patch
>
>
> As titled, after HBASE-15146 we will throw {{CallQueueTooBigException}} 
> instead of dead-wait. This is good for performance for most cases but might 
> cause a side-effect that if too many clients connect to the busy RS, that the 
> retry requests may come over and over again and RS never got the chance for 
> recovering, and the issue will become especially critical when the target 
> region is META.
> So here in this JIRA we propose to supply some special retry pause for CQTBE 
> in name of {{hbase.client.pause.special}}, and by default it will be 500ms (5 
> times of {{hbase.client.pause}} default value)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to