[ 
https://issues.apache.org/jira/browse/HBASE-23600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153849#comment-17153849
 ] 

Michael Stack commented on HBASE-23600:
---------------------------------------

Follow-up. I didn't get far w/ this patch; was unable to see much difference. 
Needs more work. Perhaps better route would be putting up a port for metadata 
only so Master writes to hbase:meta always land?

> Improve chances of edits landing into hbase:meta even when high load
> --------------------------------------------------------------------
>
>                 Key: HBASE-23600
>                 URL: https://issues.apache.org/jira/browse/HBASE-23600
>             Project: HBase
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Michael Stack
>            Priority: Major
>         Attachments: priority.rpc.patch
>
>
> Of late I've been testing clusters under high load to study failures and to 
> figure how to effect recovery if cluster is unable to recover on its own.
> One interesting case is a RS that is struggling mostly because writes to HDFS 
> are backed up and sync calls are running very slow taking a long time to 
> complete. The RPC backs up with waiting requests, and eventually goes over 
> one or more bounds. The RS then starts throwing CallQueueTooBigExceptions. 
> This struggling state can last a good while. We throw CQTBEs whatever the 
> priority of the incoming request.
> We throw CQTBE in two places; on original parse of the request before we 
> dispatch it on a handler -- here we check size of all queues and if over the 
> threshold (default 1G), throw the exception -- and then later when we 
> dispatch the request to internal queues, we'll count items in queue and if 
> over default in any one queue (default is 10 * handler count), we'll fail 
> dispatch and again throw CQTBE.
> We shouldn't be running w/ big queues. We should be rejecting Requests we 
> know we'll never process in time before client loses interest (See the CoDel 
> thesis and the implementations added a good while back. See splitting meta 
> project so all requests don't end up on one server). TODO.
> Meantime I was looking to see if having read a high-priority request, if 
> rather than dropping it on the floor, instead, what would happen if I let it 
> through even if above thresholds? My main concern is edits to hbase:meta. 
> When sustained, saturated load on the RS carrying hbase:meta, edits may not 
> land. The result is incomplete Procedures and a disorientated Master. I was 
> playing w/ trying to put off the corruption as long as possible, 
> experimenting (CoDel doesn't do priority at first blush; we probably want to 
> add this).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to