[grpc-io] A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

aakshintala Mon, 03 Sep 2018 10:40:12 -0700

Hail gRPC experts (;D),

I'm trying to build a image/video object detection server (as one of the 
reusable pieces in a benchmark suite) with low RTT requirements 
(near-realtime say ~60-90ms RTT)...
I've used gRPC and protobuf (built from git master; hashes below in case 
that is relevant) for the serialization and transport.
_________________________________
grpc: 
commit dbc1e27e2e1a81b61eb064eb036ec6a267f88cb6
Merge: 9bc6cd1 5d24ab9
Author: Jiangtao Li <email redacted by me>
Date:   Fri Jul 20 17:00:18 2018 -0700


protobuf:
commit b5fbb742af122b565925987e65c08957739976a7
Author: Bo Yang <email redacted by me>
Date:   Mon Mar 5 19:54:18 2018 -0800
_________________________________

gRPC seems to add inane amounts of overhead -- ~160ms (~2x the server's 
processing time)!
For now I'm running on a single machine (a pretty beefy machine, so 
contention isn't an issue...) operating over localhost (loopback).
The amount of data being transferred is considerable, but not unheard off 
(~4MiB per request).

Server-side timing measurements:
doDetection: new requeust 0x7ffc77f16920
0x7ffc77f16920: GPU processing took 24.045 milliseconds
0x7ffc77f16920: Server took *72.206 millisecond*

Client-side measurements:
10 objects detected.
This request took *234.825 milliseconds *

*Client RTT - Server processing time = 234.85-72.206 = 162.644ms (!??!)*
I've pinned the server and client to separate cores using taskset.
There isn't anything else running on the server and it's a beefy 48 core 
(Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz) machine with ample RAM 
(128GiB), etc....

As a start, I instrumented the implementation of the synchronous call 
in include/grpcpp/impl/codegen/client_unary_call.h:
BlockingUnaryCallImpl(ChannelInterface* channel, const RpcMethod& method,
                         ClientContext* context, const InputMessage& 
request,
                         OutputMessage* result)

and found that the vast majority of the time is spent spinning on a 
completion queue:
line 107:   if (cq.Pluck(&ops)) {

I wonder if I need to configure gRPC differently (perhaps the default 
configurations are more geared towards latency-insensitive batching?)...

Any help understanding these numbers would be appreciated.
Server code: 
https://github.com/aakshintala/darknet/blob/master/server/server.cpp
Client code: 
https://github.com/aakshintala/darknet/blob/master/server/client.cpp
Proto file: 
https://github.com/aakshintala/darknet/blob/master/server/darknetserver.proto

Thanks in advance,
Amogh Akshintala
aakshintala.com

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/cbdc3991-5d47-4bfb-a67c-340a65e2c390%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[grpc-io] A single synchronous call takes ~230ms on local host!?! What am I doing wrong?

Reply via email to