GitHub user da-daken added a comment to the discussion: Parallel Tool Call Execution
Thank you for your detailed feedback. **1. Recovery state machine** You mentioned that the proposal introduces separate PENDING slots and per-tool reconcilers within the same batch, while the current state machine advances only one cursor position at a time, which could cause conflicts. My understanding is that when a tool fetches cached results, we are currently on the mailbox thread. We will reserve a slot for each tool ahead of time, then submit the tasks to the thread pool. After all executions complete, we merge the results back into the pre‑reserved slots based on `functionId` and arguments. This way, recovery can still reconstruct the batch in the order returned by the LLM, remaining compatible with the existing slot mechanism. The new `executeAllAsync` API will lift the PENDING‑recording logic out of the current condition (`callable.reconciler() != null`), so that batch‑execution state is recorded consistently. --- **2. Failure semantics** In this version, we keep `durableExecuteAllAsync`'s failure behavior consistent with the existing `durableExecuteAsync` (i.e., single‑call failure semantics). --- **3. Side effects and duplicate calls** You rightly pointed out that parallel execution increases the number of in‑flight external tool calls. During a failover of an ongoing parallel batch, there may be multiple submitted but not yet persisted external calls, raising the risk of duplicate tool invocations after recovery. I agree. We plan to clearly document this side effect in the configuration guide for enabling parallel tool execution, and remind users to ensure their reconcilers are idempotent or otherwise handle deduplication properly to avoid correctness issues on recovery. --- **4. Concurrency limits** You noted that `num-async-threads` is global, and a `ToolRequestEvent` with many tool calls could exhaust the pool and affect other keys/operations. I agree with this concern. We will introduce a separately configurable thread pool specifically for tool execution, with configuration options to limit concurrency per batch or globally, at least in the first version, rather than relying solely on the global async pool. --- **5. Trace / event visibility** You suggested that we should leave a clear path for per‑tool‑call events in the future, as parallel execution makes visibility into each tool's start/end/status/latency more important for debugging slow or failing batches. Our design can record each tool's start time during `buildCallables` in `processToolRequest`, and after all tools finish, merge end times and statuses into a single event containing detailed execution metrics per tool. Thus, the current architecture already has extension points for tool‑level observability and won't block future monitoring or debugging enhancements. --- Thanks again for your thoughtful questions – they are very helpful for refining the design. Happy to discuss further if you have additional comments. GitHub link: https://github.com/apache/flink-agents/discussions/855#discussioncomment-17495361 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
