joeyutong opened a new issue, #859:
URL: https://github.com/apache/flink-agents/issues/859

   ### Search before asking
   
   - [x] I searched in the 
[issues](https://github.com/apache/flink-agents/issues) and found nothing 
similar.
   
   ### Description
   
   Action-scoped metrics can be recorded under the wrong action when a cached 
resource is reused across actions.
   
   Flink Agents injects the current action metric group into a resource when 
`RunnerContext.get_resource(...)` / `RunnerContext.getResource(...)` returns 
it. However, resources are cached and shared. If action A gets a chat model 
resource, then yields or waits across an async/durable boundary, action B can 
later get the same cached resource and overwrite its metric group with B's 
action scope. When action A resumes and records token metrics by reading the 
metric group from the resource field, those metrics may be recorded under B's 
action scope.
   
   This can affect paths where metrics are recorded after the request returns 
rather than at the moment the action obtains the resource. For example:
   
   - Python chat token metrics after `durable_execute` / 
`durable_execute_async`.
   - Java chat token metrics after the chat response is returned.
   - Shared cached resources used by multiple actions.
   - Cross-language wrappers and provider resources, where the wrapper and 
underlying provider may have separate metric group state.
   
   The expected behavior is that token metrics are recorded under the action 
scope that initiated the request. The metric group used for delayed metric 
recording should not depend on mutable state stored on a cached resource after 
another action may have rebound it.
   
   ### How to reproduce
   
   One minimal reproduction shape is:
   
   1. Define two actions that share the same chat model resource.
   2. Let action A obtain the chat model resource and start a chat request.
   3. Before action A records token metrics, let action B obtain the same 
cached chat model resource, causing the resource metric group to be rebound to 
B's action scope.
   4. Resume action A and record token metrics from the chat response.
   5. Observe that the token counters can be registered under B's action metric 
scope instead of A's.
   
   A unit-level reproduction can simulate the same condition by:
   
   1. Creating a setup whose connection wraps metric groups with provider 
dimensions.
   2. Binding the setup to action A's metric group.
   3. Rebinding the same setup/resource to action B's metric group before 
recording token metrics.
   4. Recording token metrics through the setup.
   5. Verifying that the counter follows the latest mutable resource metric 
group rather than the action A group that initiated the request.
   
   ### Version and environment
   
   Observed from the current main-branch code path:
   
   - Python `FlinkRunnerContext.get_resource(...)` binds the current action 
metric group to the cached resource.
   - Java `RunnerContextImpl.getResource(...)` binds the current action metric 
group to the cached resource.
   - Python and Java chat token metrics are recorded after the chat response 
returns and read the metric group from the chat model resource.
   
   This is independent of a specific deployment mode; it is a resource 
lifecycle / metric binding issue.
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to