[Apache TVM Discuss] [Development] TVM RPC for concurrent inference? (e.g., multi-request server)

zhuohoudeputao via Apache TVM Discuss Mon, 02 Feb 2026 06:47:01 -0800


Hey folks,

I’ve been using TVM RPC to test models on remote devices (like a Raspberry Pi),
and it’s great for single-stream debugging and benchmarking.

However, I hit a wall when trying to simulate a realistic server scenario.
Imagine an inference server that needs to handle multiple concurrent requests
(e.g., for an LLM API). The current RPC server seems to process requests
sequentially in a single thread/queue. This makes it impossible to saturate the
device’s compute (CPU/GPU cores) or measure true throughput under load.

My question is: Has anyone else run into this? Is there a known pattern or
workaround to use TVM RPC for concurrent, multi-client inference on a single
remote device?

Some thoughts:

1. Would launching multiple RPC server processes on different ports and
connecting to them via a client-side pool be the right approach? Feels a bit
hacky.
2. Or is the intended production path to skip RPC entirely and embed the TVM
runtime directly into a concurrent server (FastAPI/gRPC) on the remote device
itself?

I’m curious about the community’s experience and any plans to make the RPC
layer more “server-friendly” in the future. Thanks!

---
[Visit
Topic](https://discuss.tvm.apache.org/t/tvm-rpc-for-concurrent-inference-e-g-multi-request-server/18914/1)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/db3a399b640d4544c1336b42dc63ed57a63898c07c9e0f0884f7e71012f7c56a).

[Apache TVM Discuss] [Development] TVM RPC for concurrent inference? (e.g., multi-request server)

Reply via email to