[
https://issues.apache.org/jira/browse/HBASE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677304#comment-15677304
]
Gary Helmling commented on HBASE-17114:
---------------------------------------
bq. Well, if checking the uploaded patch, it's indeed tied to CQTBE only.
Introducing a new property is only for making things more flexible, and of
course we could use a hard-coded, like 5 times than the existing pause, for
CQTBE. But I'd say this is a trade-off, waiting longer for CQTBE could prevent
the vicious circle but is also causing a higher latency, and IMHO user should
be able to control such trade-off. If they don't want CQTBE to be special, they
could set hbase.client.pause.special to the same value as hbase.client.pause,
which gives them more options.
I agree with allowing the user to control the behavior here, but this is also
increasing complexity and knowledge needed for configuration tuning, which we
already have way too much of. In general, we should be moving in the direction
of making the system dynamically tune itself according to load instead of
forcing all users to grapple with yet another configuration property. By
default the configuration should be simple to provide the best experience to
all users. For advanced users who really need to treat CQTBE differently, that
should be possible by means of an override, but should not be forced on
everyone.
bq. Sorry but I don't see any difference in "should not clear the client meta
cache" and "should not retry so frequently", both trying to resolve some
problem and make things better.
These are two completely different things. I don't see the equivalence. We
don't clear the meta cache because we don't have an indication that the region
has moved, so there is no need to go back to meta. The meta cache handling is
completely independent of what is appropriate in terms of retries.
bq. No offense but I'm even thinking of making CQTBE thrown optional, because
for some case dead-wait for the request to be executed in RpcServer until
time-out is preferable by user rather than receiving some exception and retry
and fail again, but obviously this is another topic (Smile).
Blocking the RpcServer Reader threads indefinitely when the queue is full,
making the server completely unresponsive and spilling overflow back in to the
OS networking buffers is pretty poor behavior. CQTBE is a crude mechanism for
back-pressure to the client, but at least it gets the client a response and
allows it to make an informed decision about how to proceed. In the case where
the application implements its own retries the client may want to simply fail
and kick the exception back up the stack, allowing other layers to retry. Or
the client could decide to retry for a fixed duration. But in either case I
think CQTBE provides a very clear improvement in overall server behavior.
Another part of the puzzle is the CoDel scheduler which will allow more useful
work to get done in overloaded situations.
I'm all for improving the client/server interactions in these scenarios, and
what I first outlined in this issue was one idea for how to do that more
effectively. However, I would also like us to avoid unexpected surprises for
our users, and regressions in server behavior.
I'm not sure of the exact symptoms you're trying to solve, but if you're seeing
issues with meta being overloaded, then I'd suggest tuning the configuration
for the number of priority handlers and size of the priority queues. You could
also evaluate running with meta hosted on master, which together with zk-less
assignment can make region assignment much more stable.
> Add an option to set special retry pause when encountering
> CallQueueTooBigException
> -----------------------------------------------------------------------------------
>
> Key: HBASE-17114
> URL: https://issues.apache.org/jira/browse/HBASE-17114
> Project: HBase
> Issue Type: Bug
> Reporter: Yu Li
> Assignee: Yu Li
> Attachments: HBASE-17114.patch
>
>
> As titled, after HBASE-15146 we will throw {{CallQueueTooBigException}}
> instead of dead-wait. This is good for performance for most cases but might
> cause a side-effect that if too many clients connect to the busy RS, that the
> retry requests may come over and over again and RS never got the chance for
> recovering, and the issue will become especially critical when the target
> region is META.
> So here in this JIRA we propose to supply some special retry pause for CQTBE
> in name of {{hbase.client.pause.special}}, and by default it will be 500ms (5
> times of {{hbase.client.pause}} default value)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)