Hey folks,

I’ve been using TVM RPC to test models on remote devices (like a Raspberry Pi), 
and it’s great for single-stream debugging and benchmarking.

However, I hit a wall when trying to simulate a realistic server scenario. 
Imagine an inference server that needs to handle multiple concurrent requests 
(e.g., for an LLM API). The current RPC server seems to process requests 
sequentially in a single thread/queue. This makes it impossible to saturate the 
device’s compute (CPU/GPU cores) or measure true throughput under load.

My question is: Has anyone else run into this? Is there a known pattern or 
workaround to use TVM RPC for concurrent, multi-client inference on a single 
remote device?

Some thoughts:

1. Would launching multiple RPC server processes on different ports and 
connecting to them via a client-side pool be the right approach? Feels a bit 
hacky.
2. Or is the intended production path to skip RPC entirely and embed the TVM 
runtime directly into a concurrent server (FastAPI/gRPC) on the remote device 
itself?

I’m curious about the community’s experience and any plans to make the RPC 
layer more “server-friendly” in the future. Thanks!





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/tvm-rpc-for-concurrent-inference-e-g-multi-request-server/18914/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/db3a399b640d4544c1336b42dc63ed57a63898c07c9e0f0884f7e71012f7c56a).

Reply via email to