Hey folks, while I've been waiting on cherry-picks I put together a
proposal for allowing users to load more than one copy of a model in
RunInference.

Basically, today, RunInference allows loading models in 3 different ways:
1. (Default): load a model per worker process
2. Load a single copy of your model per vm (even if there are multiple
workers) using the share_model_across_processes lifecycle method on your
model handler
3. Use a KeyedModelHandler to load/offload multiple different models, with
a parameter max_models_per_worker_hint to limit total memory consumption

This document proposes an additional option building on top of option 2:
instead of loading a single copy of the model per vm, we will allow users
to configure the number of models loaded per vm.

Here's the design doc, please take a look!
https://docs.google.com/document/d/1FmKrBHkb8YTYz_Dcec7JlTqXwy382ar8Gxicr_s13c0/edit?usp=sharing

Thanks,
Danny

Reply via email to