[ 
https://issues.apache.org/jira/browse/KUDU-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227289#comment-15227289
 ] 

Todd Lipcon commented on KUDU-1395:
-----------------------------------

A couple ideas to solve this:

1) we could make KeepAlive retry with diminishing deadlines, like other RPCs do.
Pro: no server-side changes needed
Con: typically, a client would like a KeepAlive call to be light weight/fast.

2) we could add a new RPC system feature such that certain RPCs are allowed in 
a "fast lane"
- fast-lane RPCs would be limited to only those that we know consume very few 
resources and won't block on locks (eg stuff like keepalive or liveness 
heartbeats)
- these RPCs would take higher priority over all other RPCs regardless of 
deadline.
- we would probably start with a server-side annotation of which RPCs are 
fast-lane, rather than trusting clients to prioritize.

3) some fancier scheduler which tries to estimate and take into account RPC 
costs, and not just deadlines
- I'm aware of some research going on around this idea (unfortunately can't 
reference it yet since it's a pre-print). This can help both with multitenant 
fairness and better scheduling within a tenant. I'll ping the folks working on 
this research and see what the plans are for publication of the idea, since it 
might be a good fit.


> Scanner KeepAlive requests can get starved on an overloaded server
> ------------------------------------------------------------------
>
>                 Key: KUDU-1395
>                 URL: https://issues.apache.org/jira/browse/KUDU-1395
>             Project: Kudu
>          Issue Type: Bug
>          Components: impala, rpc, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> As of 0.8.0, the RPC system schedules RPCs on an earliest-deadline-first 
> basis, rejecting those with later deadlines. This works well for RPCs which 
> are retried on SERVER_TOO_BUSY errors, since the retries maintain the 
> original deadline and thus get higher and higher priority as they get closer 
> to timing out.
> We don't, however, do any retries on scanner KeepAlive RPCs. So, if a 
> keepalive RPC arrives at a heavily overloaded tserver, it will likely get 
> rejected, and won't retry. This means that Impala queries or other long scans 
> that rely on KeepAlives will likely fail on overloaded clusters since the 
> KeepAlive never gets through.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to