[
https://issues.apache.org/jira/browse/HBASE-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153849#comment-17153849
]
Michael Stack commented on HBASE-23600:
---------------------------------------
Follow-up. I didn't get far w/ this patch; was unable to see much difference.
Needs more work. Perhaps better route would be putting up a port for metadata
only so Master writes to hbase:meta always land?
> Improve chances of edits landing into hbase:meta even when high load
> --------------------------------------------------------------------
>
> Key: HBASE-23600
> URL: https://issues.apache.org/jira/browse/HBASE-23600
> Project: HBase
> Issue Type: Improvement
> Components: rpc
> Reporter: Michael Stack
> Priority: Major
> Attachments: priority.rpc.patch
>
>
> Of late I've been testing clusters under high load to study failures and to
> figure how to effect recovery if cluster is unable to recover on its own.
> One interesting case is a RS that is struggling mostly because writes to HDFS
> are backed up and sync calls are running very slow taking a long time to
> complete. The RPC backs up with waiting requests, and eventually goes over
> one or more bounds. The RS then starts throwing CallQueueTooBigExceptions.
> This struggling state can last a good while. We throw CQTBEs whatever the
> priority of the incoming request.
> We throw CQTBE in two places; on original parse of the request before we
> dispatch it on a handler -- here we check size of all queues and if over the
> threshold (default 1G), throw the exception -- and then later when we
> dispatch the request to internal queues, we'll count items in queue and if
> over default in any one queue (default is 10 * handler count), we'll fail
> dispatch and again throw CQTBE.
> We shouldn't be running w/ big queues. We should be rejecting Requests we
> know we'll never process in time before client loses interest (See the CoDel
> thesis and the implementations added a good while back. See splitting meta
> project so all requests don't end up on one server). TODO.
> Meantime I was looking to see if having read a high-priority request, if
> rather than dropping it on the floor, instead, what would happen if I let it
> through even if above thresholds? My main concern is edits to hbase:meta.
> When sustained, saturated load on the RS carrying hbase:meta, edits may not
> land. The result is incomplete Procedures and a disorientated Master. I was
> playing w/ trying to put off the corruption as long as possible,
> experimenting (CoDel doesn't do priority at first blush; we probably want to
> add this).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)