I am designing a neural network inference server and I have built my server 
and client using a synchronous grpc model, with a unary RPC design. For 
reference, the protobuf formats are based on the Nvidia Triton Inference 
server formats https://github.com/NVIDIA/triton-inference-server. My design 
expects a large batch of inputs (16384, for a total size of 1MB)  to be 
received by the server, the inference to be run, and then the result to be 
returned to the client. I send these inputs in a repeated bytes field in my 
protobuf. However, even if I make my server-side function simply return an 
OK status (no actual processing), I find that the server can only process 
~1500-2000 batches of inputs per second (this is run with both server and 
client on the same machine so network limitations should not be relevant). 
However, I know that my inference processing can handle throughputs closer 
to 10000 batches/second.

Is there an inherent limitation to the number of requests that a gRPC 
server can handle per second? Is there a server setting or design change I 
can make to increase this maximum throughput?

I am happy to provide more information if it can help in understanding my 
issue.

Thanks for your help,

-Dylan

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/f9446e7e-aca2-4d0a-a7bf-8b52f2500dc9%40googlegroups.com.

Reply via email to