GitHub user da-daken added a comment to the discussion: Parallel Tool Call 
Execution

Thank you for your detailed feedback.

**1. Recovery state machine**  
You mentioned that the proposal introduces separate PENDING slots and per-tool 
reconcilers within the same batch, while the current state machine advances 
only one cursor position at a time, which could cause conflicts.  

My understanding is that when a tool fetches cached results, we are currently 
on the mailbox thread. We will reserve a slot for each tool ahead of time, then 
submit the tasks to the thread pool. After all executions complete, we merge 
the results back into the pre‑reserved slots based on `functionId` and 
arguments. This way, recovery can still reconstruct the batch in the order 
returned by the LLM, remaining compatible with the existing slot mechanism. The 
new `executeAllAsync` API will lift the PENDING‑recording logic out of the 
current condition (`callable.reconciler() != null`), so that batch‑execution 
state is recorded consistently.

---

**2. Failure semantics**  
In this version, we keep `durableExecuteAllAsync`'s failure behavior consistent 
with the existing `durableExecuteAsync` (i.e., single‑call failure semantics).

---

**3. Side effects and duplicate calls**  
You rightly pointed out that parallel execution increases the number of 
in‑flight external tool calls. During a failover of an ongoing parallel batch, 
there may be multiple submitted but not yet persisted external calls, raising 
the risk of duplicate tool invocations after recovery.  

I agree. We plan to clearly document this side effect in the configuration 
guide for enabling parallel tool execution, and remind users to ensure their 
reconcilers are idempotent or otherwise handle deduplication properly to avoid 
correctness issues on recovery.

---

**4. Concurrency limits**  
You noted that `num-async-threads` is global, and a `ToolRequestEvent` with 
many tool calls could exhaust the pool and affect other keys/operations.  

I agree with this concern. We will introduce a separately configurable thread 
pool specifically for tool execution, with configuration options to limit 
concurrency per batch or globally, at least in the first version, rather than 
relying solely on the global async pool.

---

**5. Trace / event visibility**  
You suggested that we should leave a clear path for per‑tool‑call events in the 
future, as parallel execution makes visibility into each tool's 
start/end/status/latency more important for debugging slow or failing batches.  

Our design can record each tool's start time during `buildCallables` in 
`processToolRequest`, and after all tools finish, merge end times and statuses 
into a single event containing detailed execution metrics per tool. Thus, the 
current architecture already has extension points for tool‑level observability 
and won't block future monitoring or debugging enhancements.

---

Thanks again for your thoughtful questions – they are very helpful for refining 
the design. Happy to discuss further if you have additional comments.

GitHub link: 
https://github.com/apache/flink-agents/discussions/855#discussioncomment-17495361

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to