[ 
https://issues.apache.org/jira/browse/KUDU-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447192#comment-17447192
 ] 

Alexey Serbin commented on KUDU-2955:
-------------------------------------

Thank you for taking a look at this, [~redriver]!

Looking at the root cause of the issue and the proposal above to address it, 
I'm thinking that we might try to address the root cause of the issue -- the 
wrong placement of the RPC methods without changing the underlying logic of 
dispatching the RPCs and registering the services in the scope of this JIRA 
item.  The rationale is that from the design perspective, it's better to keep 
methods of different nature/domain in separate services, so the fact that admin 
and non-admin RPCs are all served by the {{MasterService}} service is just an 
issue which we will need to address in the long run anyway.  Introducing an 
extra layer in the RPC lookup and RPC registration to address the problem of 
wrong placement of the RPC methods might be a path forward as well, but I think 
we need to explore some other alternatives before starting implementation.

I think that we might explore an alternative way to address the issue without 
introducing any extra mapping layer or extra dynamics for the registration of 
RPC methods, and still not breaking backwards-compatibility with existing 
versions of the Kudu client API.  Given that the versions of {{kudu-master}} 
and {{kudu-tserver}} binaries (and even the binary of the {{kudu}} CLI tool) 
are usually kept in sync most of the time in a Kudu cluster, there is a 
alternative: introduce a new RPC interface (say, {{MasterAdminService}}) to run 
along with {{MasterService}} and instantiate all those admin methods in that 
interface as well.  At some point (maybe, at Kudu 2.0 version, which doesn't 
have to be compatible with Kudu 1.x versions since we use the semantic 
versioning), those admin methods can be removed from the {{MasterService}} (so 
only the non-admin will be there in Kudu 2.0), and be present only in 
{{MasterAdminService}}.  But meanwhile, Kudu 1.x masters will have those 
methods in two interfaces: {{MasterService}} and {{MasterAdminService}}.

For Kudu clusters of newer versions, the tablet servers and the masters are to 
be updated to use those methods as a part of the newly introduced 
{{MasterAdminService}} (the only caveat is that it will be necessary to upgrade 
masters first, and then upgrade tablet servers, but not vice versa).   With 
that, Kudu clients of prior and later versions will still send non-admin 
requests via the {{MasterService}}, but the master and tablet servers will 
start sending admin RPCs via the {{MasterAdminService}}, and that should be 
good enough already: those RPCs will be placed into separate queues.  For 
rolling upgrades, the older Kudu tablet servers will be still able to 
communicate with newer Kudu masters, but once all the binaries in the cluster 
are updated, all tablet servers will use only the {{MasterAdminService}} to 
send admin requests to masters.

Dealing with the kudu CLI tool is a bit more nuanced: we want to make sure the 
kudu CLI tool of  newer versions is able to work with Kudu clusters of older 
versions.  IIRC, we already have the code in the kudu CLI taking into account 
the version of the servers it's working with (e.g., check {{FetchFlags()}} and 
3-2-3 vs 3-4-3 replica management scheme).

What do you think?

> kudu-master: separate RPC service queues for TSHeartbeat from client-facing 
> RPCs
> --------------------------------------------------------------------------------
>
>                 Key: KUDU-2955
>                 URL: https://issues.apache.org/jira/browse/KUDU-2955
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, rpc
>            Reporter: Alexey Serbin
>            Priority: Major
>
> As of now, all client-related RPCs like {{ConnectToMaster}}, 
> {{GetTabletLocations}}, {{GetTableLocations}}, {{GetTableSchema}}, etc., 
> tserver-related RPC {{TSHeartbeat}}, and other administrative RPCs are all 
> put into the same RPC service queue upon arrival.  In some cases of 
> congestion (e.g., full tablet reports from all tablet servers in a cluster 
> upon the change in the master leadership) and aggravating factors such as 
> slow master's WAL, that might lead to dropping requests sent from Kudu 
> clients to master, even if the inflow of client requests isn't high and the 
> client request might be served in parallel with processing {{TSHeartbeat}} 
> sent from tablet servers.
> It would be nice to put all {{TSHeartbeat}} requests and other administrative 
> requests into one service queue, and all the client-originated requests into 
> another.  That way spikes of RPC inflow from clients would not affect 
> internal cluster bookkeeping and vice versa.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to