I am designing a neural network inference server and I have built my server and client using a synchronous grpc model, with a unary RPC design. For reference, the protobuf formats are based on the Nvidia Triton Inference server formats https://github.com/NVIDIA/triton-inference-server. My design expects a large batch of inputs (16384, for a total size of 1MB) to be received by the server, the inference to be run, and then the result to be returned to the client. I send these inputs in a repeated bytes field in my protobuf. However, even if I make my server-side function simply return an OK status (no actual processing), I find that the server can only process ~1500-2000 batches of inputs per second (this is run with both server and client on the same machine so network limitations should not be relevant). However, I know that my inference processing can handle throughputs closer to 10000 batches/second.
Is there an inherent limitation to the number of requests that a gRPC server can handle per second? Is there a server setting or design change I can make to increase this maximum throughput? I am happy to provide more information if it can help in understanding my issue. Thanks for your help, -Dylan -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/f9446e7e-aca2-4d0a-a7bf-8b52f2500dc9%40googlegroups.com.
