Late last year, I added support for vLLM in RunInference. I ended up being
able to go from prototyping to checked in code quickly enough that I didn't
put together/share a full design, but in retrospect I thought it might be
helpful to have a record of what I did since others might want to do
similar things with other serving systems (e.g. NIM, Triton, etc...). In
the process of writing that document, I was adding in a lot of context
about how memory management works in Beam ML and decided to include that in
the document as well.

The end result is
https://docs.google.com/document/d/1UB4umrtnp1Eg45fiUB3iLS7kPK3BE6pcf0YRDkA289Q/edit?usp=sharing
with
the following goals:

   - Describe how model sharing works in Beam today
   - Describe how we used those primitives to build out the vLLM Model
   Handler
   - Describe how others can add similar model handlers

I hope this helps someone :) If you're interested, please give it a read,
and please let me know if you have any questions/feedback/ideas on how we
can keep improving our memory management story. I'll be adding this to
https://cwiki.apache.org/confluence/display/BEAM/Design+Documents to serve
as a general reference on this topic moving forward.

Thanks,
Danny

Reply via email to