Hey folks, while I've been waiting on cherry-picks I put together a proposal for allowing users to load more than one copy of a model in RunInference.
Basically, today, RunInference allows loading models in 3 different ways: 1. (Default): load a model per worker process 2. Load a single copy of your model per vm (even if there are multiple workers) using the share_model_across_processes lifecycle method on your model handler 3. Use a KeyedModelHandler to load/offload multiple different models, with a parameter max_models_per_worker_hint to limit total memory consumption This document proposes an additional option building on top of option 2: instead of loading a single copy of the model per vm, we will allow users to configure the number of models loaded per vm. Here's the design doc, please take a look! https://docs.google.com/document/d/1FmKrBHkb8YTYz_Dcec7JlTqXwy382ar8Gxicr_s13c0/edit?usp=sharing Thanks, Danny