I encountered a weird behavior in gRPC.


*The symptom* - an active RPC stream is signaled as cancelled on server 
side (happens from time to time, I couldn't find any correspondence with 
other events in the environment) although the client is active and the 
stream shouldn't be closed.

It happens for streams initialized as response streams in RPC calls from 
both C++ and NodeJS clients. *It happened on gRPC 1.3.6 and still happens 
on gRPC v1.6.0*.

The problem does not reproduced easily - the system is executed under heavy 
load for many hours until this happens.


In my code, I have 2 main types of streams:


   1. Control stream (C++→C#) - the client initiates an RPC call to the 
   server, which keeps the RPC's response stream opened.
   Those streams are used as *control channels* with the *C++ clients* and 
   are kept open to allow server-to-client requests. When they are closed, 
   both client and server clean up all data related to the connection. So, the 
   control stream is critical to the session.
   The server registers on call cancellation notification:
    ServerCallContext context; // Received from RPC call as a parameter
    // ...
    context.CancellationToken.Register(() => System.Threading.ThreadPool.
   QueueUserWorkItem(async obj => { handle_disconnection(...); }));
   
   The total number of opened control streams (AKA number of connected C++ 
   clients) is ~1200. 
   2. Command stream (NodeJS→C#) - There are many many other streams for 
   server-to-client command response communication, which are kept opened in 
   parallel by the server with *NodeJS clients*. The total number of opened 
   streams is 20K-30K. 

The problem is noticeable when the control streams get disconnected.

*Further investigation of the client (C++) and server (C#) logs of control 
stream disconnection, revealed to following*:


   1. For some reason, the server's cancellation token (the one registered 
   above) is signaled - and the server does its cleanup 
   (`handle_disconnection` which also closes many command streams 
   intentionally). *According to the client, the connection should have 
   remained opened.*
   2. After some time, the client realizes the connection was closed 
   unexpectedly and does its cleanup - throwing the error I posted here 
   <https://github.com/grpc/grpc/issues/12425#issuecomment-329958701> 
   (NodeJS in that case). *The clients disconnects itself only after the 
   server disconnects the connection and control stream.*

Another note - I set the servers' RequestCallTokensPerCompletionQueue value 
for both C++/NodeJS client interfaces, to 32768 (32K) per completion queue.

I have 2 server interfaces (for node clients and C++ clients, which have 
different API), and 4 completion queues (for 8 cores machine). I don't 
really know if the 4 completion queues are global, or per-server.

*Do you think it might cause those streams to be closed under heavy load*?

 

In any case, my suspicious is on the C# server behavior - the 
CancellationToken is signaled for no apparent reason.

I *didn't* rule out network instability yet - although both clients and 
server are located on the same ESX server with 10-gig virtual adapters 
between them, so this is quite a long-shot.

 

Do you have any idea how to solve this?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/aae6feda-a932-4de5-8519-22c928a36a31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to