Late last year, I added support for vLLM in RunInference. I ended up being able to go from prototyping to checked in code quickly enough that I didn't put together/share a full design, but in retrospect I thought it might be helpful to have a record of what I did since others might want to do similar things with other serving systems (e.g. NIM, Triton, etc...). In the process of writing that document, I was adding in a lot of context about how memory management works in Beam ML and decided to include that in the document as well.
The end result is https://docs.google.com/document/d/1UB4umrtnp1Eg45fiUB3iLS7kPK3BE6pcf0YRDkA289Q/edit?usp=sharing with the following goals: - Describe how model sharing works in Beam today - Describe how we used those primitives to build out the vLLM Model Handler - Describe how others can add similar model handlers I hope this helps someone :) If you're interested, please give it a read, and please let me know if you have any questions/feedback/ideas on how we can keep improving our memory management story. I'll be adding this to https://cwiki.apache.org/confluence/display/BEAM/Design+Documents to serve as a general reference on this topic moving forward. Thanks, Danny