Hey folks,
I’ve been using TVM RPC to test models on remote devices (like a Raspberry Pi), and it’s great for single-stream debugging and benchmarking. However, I hit a wall when trying to simulate a realistic server scenario. Imagine an inference server that needs to handle multiple concurrent requests (e.g., for an LLM API). The current RPC server seems to process requests sequentially in a single thread/queue. This makes it impossible to saturate the device’s compute (CPU/GPU cores) or measure true throughput under load. My question is: Has anyone else run into this? Is there a known pattern or workaround to use TVM RPC for concurrent, multi-client inference on a single remote device? Some thoughts: 1. Would launching multiple RPC server processes on different ports and connecting to them via a client-side pool be the right approach? Feels a bit hacky. 2. Or is the intended production path to skip RPC entirely and embed the TVM runtime directly into a concurrent server (FastAPI/gRPC) on the remote device itself? I’m curious about the community’s experience and any plans to make the RPC layer more “server-friendly” in the future. Thanks! --- [Visit Topic](https://discuss.tvm.apache.org/t/tvm-rpc-for-concurrent-inference-e-g-multi-request-server/18914/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/db3a399b640d4544c1336b42dc63ed57a63898c07c9e0f0884f7e71012f7c56a).
