[grpc-io] Re: C++ multiple CompletionQueues blocking eachother

BeyondDefinition Tue, 13 Nov 2018 07:59:25 -0800

Hey!

Thank you for thinking along!
Basically they do execute the same code as far as possible. If you have 2 
queues they would inevitably have to be handled in 2 different threads, as 
the cq.Next function is blocking. Of course there are differences for 
normal request/response or stream events. I've written the sample code 
separated to clearly show the issue.


Server side the choice for the queue is made upon startup:

   1. 'Request/response calls' are queued on cq1. The GRPC generated async 
   methods (in *.grpc.pb.h), e.g. Request*Call*(context, request, response, 
   new_call_cq, notification_cq, tag) are called with new_call_cq=*cq1* and 
   notification_cq=*cq1*.
   2. 'Stream calls' are queued on cq2. Those GRPC generated async methods 
   are called with new_call_cq=*cq2* and notification_cq=*cq2*.

This choice is made because calls (on queue 1) are allowed to block and 
should not interfere with events on the stream queue (2).

To give you an idea, the whole concept is functional for some months now 
and working beautifully under nominal behavior. However when stressed (in 
test scenario), the startup/reconnect behavior is showing the described 
inconsistencies.
Using 1 cq would work 100% of the time, as the issue only reproduces with 
(at least) 2 queues. I do not see any issues with queue draining and server 
shutdown as of yet. Yes, the examples could use some improvements!

I've managed to reproduce it in GDB and it shows me the thread for cq1 is 
sleeping (10s), while the second thread is under water in cq2.Next 
somewhere.

I hope to find an answer or pointers to debug why the cq2.Next function is 
not returning until cq1.Next is called again, while events for cq2 arrived 
on the socket way before already. I can make the sleep before cq1.Next as 
long as I want.
I'll could try to create a separate minimal reproduction project to 
demonstrate the issue, but GRPC requires quite a lot of handling before 
things start to work. I will work on this in parallel.


On Tuesday, November 13, 2018 at 3:33:08 PM UTC+1, [email protected] 
wrote:
>
> Hi,
>
> (I am not a gRPC specialist in any way)
>
> But I would say no, you are not correct.
> You can have multiple CQ if you want, but if you do, they must all execute 
> the exact same code(that would be HandleRpcs in the samples).
>
> Back to your code: how would a client request "choose" the server queue to 
> be processed on?
> It is gonna be processed by one of the queue "randomly"(not really, but I 
> don't know the algorithm chosen by grpc for this).
>
> How are you setting up your routes?
> If you really want two separate cq(ie with a different base class(eg 
> Calldata vs StreamData), they should be started with separate routes(maybe 
> it's already the case?). 
>
> In general, you should start by having a non-threaded version functional.
> Once it's done, you can start threading hot spots, with the help of a 
> profiler.
> So if I were I would:
> - start with only 1 cq
> - write a test client(s), and make sure it works 100% of the time
> - later: add more cq
> You can use MemorySanitizer and ThreadSanitizer to check your threaded 
> code.
> Warning: proper server shutdown & queue draining is NOT shown the grpc 
> samples
>
> On Monday, November 12, 2018 at 3:00:31 PM UTC+1, BeyondDefinition wrote:
>>
>> The following minimized (pseudo)-code describes the reproduction scenario:
>>
>> // Thread 1:                                            Thread 2:
>> void* got_tag; void* got_tag;
>> bool ok; bool ok;
>>
>> printf("Waiting for stream events");
>> while (cq1->Next(&got_tag, &ok)) while (cq2->Next(&got_tag, &ok))
>> { {
>> static_cast<CallData*>(got_tag)->Update(ok); printf("New stream event");
>> printf("Sleeping 10s"); static_cast<StreamData*>(got_tag)->Update(ok);
>> std::this_thread::sleep_for(10s); printf("Stream event finished");
>> } }
>>
>> In the error situation, the following output is generated:
>>
>> Waiting for stream events
>>   (...)
>> Sleeping 10s
>>   (10 seconds delay)
>> New stream event
>> Stream event finished
>> New stream event
>> Stream event finished
>>
>>
>> On Monday, November 12, 2018 at 2:30:52 PM UTC+1, [email protected] 
>> wrote:
>>>
>>> I am working on a asynchronous server-side integration of the GRPC in 
>>> C++. I already solved quite some mistakes and misunderstandings, and 
>>> overall it is very stable. Just 1 issue in the startup behavior is making 
>>> my life difficult for the time being.
>>>
>>> *Introduction:*
>>> I wrote a test that starts 1 client and restarts the server-side 
>>> multiple times. With restarting I mean shutting down the completion queues 
>>> with attached threads, including the 'grpc::Server' and re-creating them 
>>> again. The client is never restarted and just reconnects. This consistently 
>>> happens without any lockups or complaints from GRPC.
>>>
>>> Server-side there are 2 CompletionQueues, handled in 2 separate threads:
>>> 1. is accepting requests from the client and respond using 
>>> ServerAsyncResponseWriter.
>>> 2. is accepting streams from the client and send updates from server to 
>>> client using ServerAsyncWriter.
>>>
>>> Client-side there is 1 CompletionQueue to handle ClientAsyncReader 
>>> events in it's own thread. Requests to the server are implemented 
>>> synchronously.
>>> The backoff algorithm is configured to reconnect to the server within 1s 
>>> +-0.2s. The client monitors the channel status using (async) 
>>> NotifyOnStateChange with a timeout of 2 seconds and sends the stream 
>>> requests as soon as the channel is up.
>>>
>>> I've separated the client and server implementation into 2 separate 
>>> processes to ensure there is no interference whatsoever.
>>>
>>> *The issue:*
>>> Sometimes, the server seems to block all events in the 'stream 
>>> CompletionQueue' (thread 2) when blocked in the request thread (1). More 
>>> specifically: Thread 2 is blocked until ::grpc::CompletionQueue::Next is 
>>> called in thread 1. I've deliberately added a long sleep just before 
>>> calling cq1->Next in thread 1 to ensure the issue still reproduces and it 
>>> does. The printf in thread 2 just after cq2->Next is not triggered until 
>>> the sleep finishes.
>>>
>>> While sleeping, multiple stream connection attempts arrive from the 
>>> client (supposedly in the second CompletionQueue). I verified this by 
>>> capturing the TCP stream. These events arrive directly after the request 
>>> message. As soon as 'Next' is called in thread 1, these connection attempts 
>>> in thread 2 are handled immediately.
>>>
>>> For me it reproduces about every 5-10 cycles. Is there a proper way to 
>>> debug this behavior in GRPC? Which verbosity flags should I enable? Am I 
>>> doing any correct assumptions about multiple CompletionQueues?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/c7ab69c8-d89e-4768-bbc3-c7110dfb7fa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[grpc-io] Re: C++ multiple CompletionQueues blocking eachother

Reply via email to