damccorm opened a new pull request, #39074:
URL: https://github.com/apache/beam/pull/39074

   ## Problem
   Currently, `ADKAgentModelHandler` only supports deploying and calling remote 
model endpoints. This limits its usability in scenarios where the user wants to 
deploy a local model (e.g., using a local PyTorch or HuggingFace model) on the 
Beam workers.
   
   ## Solution
   This PR extends `ADKAgentModelHandler` to allow deploying a local model 
serving process instead of relying on a remote model endpoint.
   
   Key changes:
   1.  **Generic SubProcess Wrapper**: Introduced `SubProcessModel` (inherits 
from `SubprocessModelHandler`) in `apache_beam/ml/inference/base.py`. This 
wrapper can wrap *any* standard `ModelHandler` and serve it via a FastAPI 
server running in a subprocess.
   2.  **FastAPI Subprocess Server**: Added `subprocess_server.py` which runs 
in the subprocess. It exposes:
       *   OpenAI-compatible `/v1/chat/completions` and `/v1/completions` 
endpoints (for compatibility with ADK's `LiteLlm` client).
       *   A transparent pickle-based binary `/v1/beam/inference` endpoint to 
preserve exact input/output types when used directly in Beam pipelines.
   3.  **ADK Integration**: Updated `ADKAgentModelHandler` to:
       *   Accept an `underlying_model_handler` (which must be a 
`SubprocessModelHandler`, such as `SubProcessModel` or 
`VLLMCompletionsModelHandler`).
       *   Manage the lifecycle of the local server process.
       *   Propagate the local model configuration (endpoint base URL with 
dynamically selected port) to the root agent, subagents, and tools.
       *   Handle dynamic port updates if the server restarts.
       *   Implement recovery/connectivity checks on inference failures.
   4.  **Refactoring**: Refactored `vllm_inference.py` to use a new generic 
`_VLLMBaseModelHandler` to eliminate code duplication between completions and 
chat handlers.
   5.  **Robustness & Safety**:
       *   Improved temporary directory lifecycle management to prevent 
premature cleanup by GC.
       *   Bound the subprocess server to `127.0.0.1` instead of `0.0.0.0` for 
security.
       *   Improved error propagation and logging for both client and server 
sides.
   
   ## Verification
   *   Added comprehensive unit tests in `base_test.py` (`SubProcessModelTest`) 
and `agent_development_kit_test.py` (`TestLocalModelIntegration`).
   *   Verified all tests pass.
   
   TAG=agy
   CONV=b2f450cf-8b63-49a9-8ee8-430ed676e116
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to