The "serverCallContext.CancellationToken" is triggered any time the
server-side call terminates prematurely with an error (it doesn't
necessarily mean there was a cancellation, it happens anytime there's some
error). Based on what you're saying (it takes hours of heavy load to
reproduce), it might well just be that this is a bug in C# or C-core stack,
that rarely gets triggered - but it's hard to tell with certainty unless
you're able to come up with some evidence (like traces showing what exactly
went wrong).
Can you reproduce this on a single machine (would rule out network
problems)? Can you reproduce when there's much lower number of concurrently
open streams? Can you reproduce when only using C++ clients a servers (that
would point to a potential problem in C-core)?
Have you checked values of relevant channel arguments? (e.g.
https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h#L146)
On Monday, September 18, 2017 at 3:25:18 PM UTC+2, Amit Waisel wrote:
>
> I encountered a weird behavior in gRPC.
>
>
> *The symptom* - an active RPC stream is signaled as cancelled on server
> side (happens from time to time, I couldn't find any correspondence with
> other events in the environment) although the client is active and the
> stream shouldn't be closed.
>
> It happens for streams initialized as response streams in RPC calls from
> both C++ and NodeJS clients. *It happened on gRPC 1.3.6 and still happens
> on gRPC v1.6.0*.
>
> The problem does not reproduced easily - the system is executed under
> heavy load for many hours until this happens.
>
>
> In my code, I have 2 main types of streams:
>
>
> 1. Control stream (C++→C#) - the client initiates an RPC call to the
> server, which keeps the RPC's response stream opened.
> Those streams are used as *control channels* with the *C++ clients* and
> are kept open to allow server-to-client requests. When they are closed,
> both client and server clean up all data related to the connection. So,
> the
> control stream is critical to the session.
> The server registers on call cancellation notification:
> ServerCallContext context; // Received from RPC call as a parameter
> // ...
> context.CancellationToken.Register(() => System.Threading.ThreadPool.
> QueueUserWorkItem(async obj => { handle_disconnection(...); }));
>
> The total number of opened control streams (AKA number of connected
> C++ clients) is ~1200.
> 2. Command stream (NodeJS→C#) - There are many many other streams for
> server-to-client command response communication, which are kept opened in
> parallel by the server with *NodeJS clients*. The total number of
> opened streams is 20K-30K.
>
> The problem is noticeable when the control streams get disconnected.
>
> *Further investigation of the client (C++) and server (C#) logs of control
> stream disconnection, revealed to following*:
>
>
> 1. For some reason, the server's cancellation token (the one
> registered above) is signaled - and the server does its cleanup
> (`handle_disconnection` which also closes many command streams
> intentionally). *According to the client, the connection should have
> remained opened.*
> 2. After some time, the client realizes the connection was closed
> unexpectedly and does its cleanup - throwing the error I posted here
> <https://github.com/grpc/grpc/issues/12425#issuecomment-329958701>
> (NodeJS in that case). *The clients disconnects itself only after the
> server disconnects the connection and control stream.*
>
> Another note - I set the servers' RequestCallTokensPerCompletionQueue
> value for both C++/NodeJS client interfaces, to 32768 (32K) per completion
> queue.
>
> I have 2 server interfaces (for node clients and C++ clients, which have
> different API), and 4 completion queues (for 8 cores machine). I don't
> really know if the 4 completion queues are global, or per-server.
>
> *Do you think it might cause those streams to be closed under heavy load*?
>
>
>
> In any case, my suspicious is on the C# server behavior - the
> CancellationToken is signaled for no apparent reason.
>
> I *didn't* rule out network instability yet - although both clients and
> server are located on the same ESX server with 10-gig virtual adapters
> between them, so this is quite a long-shot.
>
>
>
> Do you have any idea how to solve this?
>
> Thanks!
>
--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/8843c08c-b15c-4b2c-833c-9912d21da0e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.