[D] Q2 roadmap [tooling-gofannon]

via GitHub Tue, 12 May 2026 15:07:44 -0700


GitHub user andrewmusselman created a discussion: Q2 roadmap


## Already shipped

1. **Rename sandbox → runs** — frontend rename, history list, form pre-fill, 
re-run button. Component files renamed; URL `/agent/:agentId/sandbox` → 
`/agent/:agentId/runs` with new nested `/agent/:agentId/runs/:runId`. Backend 
identifiers (`RunCodeRequest`, `sandbox_run` log strings, `sandbox_agent` 
placeholder) intentionally kept. ✓ merged to main.

2. **Runs page reads from saved agent doc, not stale agentFlowContext** — the 
silent-staleness bug that caused max_tokens overrides to vanish, output schema 
to collapse to `{outputText: string}`, and list/json input fields to drop. 
Always-refetch when agentId is in URL. ✓ merged to main.

3. **DB perf — bulk ops, optimistic writes, deferred access tracking** — 
`save_many`/`delete_many`/`get_many` on the backend (CouchDB uses `_bulk_docs` 
and `_all_docs?keys=`). Service-layer rewrites of `set_many` (2 round trips 
regardless of N), `clear_namespace` (2 round trips), `set()` optimistic with 
conflict retry. Access tracking deferred to `AccessAccumulator` background 
flush every 10s. 21 new tests; `data_store_service.py` to 91% coverage. ✓ 
merged to main.

4. **Per-model llm_settings** — `LlmSettings` carries a `perModel` map keyed by 
`<provider>/<model>` so a Sonnet call gets Sonnet's overrides instead of 
`invokableModels[0]`'s. ✓ merged to main.

5. **ModelConfigDialog UX** — `max_tokens` as numeric TextField (was an 
unusably-precise Slider over 1..128000); inline Alert for mutex conflicts 
(temperature/top_p) with one-click resolution buttons; cleared mutex partners 
stay cleared across dialog re-open. ✓ merged to main.

6. **Log redaction** — common credential shapes (PATs, 
OpenAI/Anthropic/AWS/Google/Slack/Stripe/JWT, Authorization headers, BEGIN 
PRIVATE KEY markers, generic api_key=/secret=/password= patterns) stripped from 
log/trace events. 61 tests with synthesized fixtures (no literal tokens on disk 
to satisfy GitHub's secret scanner). ✓ merged to main.

## Open backlog — to file as issues

### Issue: Multi-tenant non-blocking runtime (run registry + persistent run 
state)

**Problem.** Today an agent runs inside the request thread. The streaming SSE 
response holds the connection open for the run's full duration. Two 
consequences:

- The browser tab is the run. Refresh / navigate / close the tab and the SSE 
connection drops; the run actually keeps executing on the backend (asyncio.Task 
isn't cancelled on disconnect — the `event_generator` `finally` block awaits 
the task), but the user has no way to see results when they come back.
- `uvicorn --reload` kills in-flight runs. We've felt this in practice.

These are also the things that block "true multi-tenancy" — multiple users 
running multiple agents at the same time and all seeing their work survive a 
tab close, a disconnect, or a brief deploy. The multi-process worker model 
isn't a prerequisite (single-process is fine for now); what's missing is the 
run state being independent of the request that started it.

**Proposed solution.** Run registry at the user-service level. A dict keyed by 
`run_id`, value is a `RunRecord`:

```python
@dataclass
class RunRecord:
    run_id: str
    user_id: str
    agent_name: str
    started_at: datetime
    status: Literal["running", "success", "error", "stopped"]
    trace: Trace
    queue: asyncio.Queue
    task: asyncio.Task            # the agent's execution
    cancel_token: CancelToken     # for the stop button
    result: Optional[dict]
    error: Optional[str]
    completed_at: Optional[datetime]
    schema_warnings: Optional[list]
    ops_log: Optional[list]
```

Two storage tiers: in-memory dict for active and recently-completed runs (last 
hour); persisted to CouchDB on completion (when the in-memory entry is 
evicted). Survives uvicorn restart for completed runs; in-flight runs at 
restart still die (full durability would need a worker-process model — 
separate, bigger refactor).

**New endpoints:**

```
POST /agents/run-code/start        # NEW — kick off a run, return {run_id} 
immediately
GET  /runs                         # NEW — list current user's runs
GET  /runs/{run_id}                # NEW — fetch run state
GET  /runs/{run_id}/stream         # NEW — subscribe; if running, live SSE + 
replay;
                                   #       if complete, bulk replay then close
POST /runs/{run_id}/stop           # NEW — set cancel token; respond 202
DELETE /runs/{run_id}              # NEW — clear from in-memory registry
POST /agents/run-code/stream       # Existing — kept compatible; first frame now
                                   #            includes run_id
```

**Frontend touches.** A `useRun(run_id)` hook subscribing to 
`/runs/{run_id}/stream` and exposing `{status, events, result, error, 
isStreaming}`. Used by the runs page now and by the home-page module later.

**Effort.** ~400 LOC backend + ~350 LOC frontend.

**Auth.** Each run is owned by the user who started it. `GET /runs/{run_id}` 
rejects with 403 if `request.user.uid != run.user_id`. Standard scope check; 
this is what makes multi-tenancy safe.

**Open questions.**
- Persistence of completed runs: CouchDB-backed in the same database as 
everything else? (Recommended.) Or in-memory only, accept they're gone on 
restart? (Honest about today's reality.)
- Single-process assumption: scaling to multiple replicas means the in-memory 
registry becomes per-replica, and a request to `/runs/{run_id}/stream` could 
land on the wrong replica. Fix is sticky sessions or a shared store (Redis). 
Worth knowing now even if not fixing now.

---

### Issue: Runs page subscribes by run_id (not just for the current request)

**Problem.** The runs page today fires `POST /agents/run-code/stream` and reads 
the SSE stream inline. If you navigate away, the stream drops; if you come 
back, you get nothing. The history list shows past runs only via in-memory 
state, so it disappears on refresh.

**Proposed solution.** Refactor the runs page to:

1. Click Run → `POST /agents/run-code/start`, get `{run_id}`.
2. `useRun(run_id)` mounts and subscribes to `/runs/{run_id}/stream`.
3. On navigation away and back, `useRun(run_id)` reconnects to the same 
registry entry — replay events that fired during disconnect, then transition to 
live.
4. `/agent/:agentId/runs/:runId` deep-links to an existing run (live or 
complete).
5. History list reads from `GET /runs?agent_id=X` so it persists across 
sessions.

**Depends on:** run registry above.

**Open question.** Reconnect semantics: when a client reconnects, does it get a 
full replay of events emitted before connection? Recommended yes — replay all 
`RunRecord.trace.events`, then transition to live streaming. Keeps UX simple.

---

### Issue: Home page Running Jobs module

**Problem.** A user running multiple agents has no cross-agent view of "what's 
running right now" or "what just finished". To check on something started 
elsewhere they have to navigate to that specific agent's runs page.

**Proposed solution.** New module on HomePage above (or beside) the agents 
list. Lists the user's runs across all agents — status chip, agent name, 
started-at, run_id. Click → `/agent/:agentId/runs/:runId`. Auto-refresh by 
polling `GET /runs` every few seconds. Live status updates by re-fetching, not 
subscribing to SSE per row (overkill).

**Depends on:** run registry, runs page using run_id.

---

### Issue: Stop button

**Problem.** Long-running agents sometimes need to be interrupted. The earlier 
"Cancel" button only aborted the client-side fetch; the server-side task kept 
running until completion. We want the run to actually stop.

**Proposed solution.** Cooperative cancellation, layered:

- **CancelToken via contextvar** (same pattern as Trace). The agent runtime 
checks `should_stop()` between operations.
- **Enforcement at structural boundaries.** Wrap `tools.call_llm`, 
`tools.data_store.*`, and `gofannon_client.call` with an entry check — if 
stopping, raise `AgentStopped` immediately without executing. In-flight LLM 
calls finish naturally; only the next attempt to do anything observable raises.

This avoids `task.cancel()` raising `CancelledError` mid-await inside the LLM 
call's HTTP request — too aggressive.

**UI.** Stop button next to Run; disabled when no run in flight. While stopping 
(after click, before halt), button shows "Stopping… (after current LLM call 
completes)" disabled. Run's outcome becomes a third state `stopped`, neutral 
chip color in the Progress Log (gray with a stop icon, not red).

**Depends on:** run registry (for the cancel token to live somewhere 
addressable by `POST /runs/{run_id}/stop`).

**Open questions.**
- Stop semantics for chained agents: when agent X is stopping and X has called 
Y, does Y stop too? Recommended yes — stop means the whole tree. Contextvar 
makes this trivial.
- UI feedback during "Stopping…" — could take 30+ seconds in the worst case 
(waiting for LLM response). UI should be honest: "Stopping after current LLM 
call completes…" rather than promising instant.

---

### Issue: Per-agent environment variables

**Problem.** Agents have tunable knobs (e.g. `GITHUB_PUSH_CONCURRENCY` in the 
ASVS pipeline) that today have to be hardcoded, threaded through `inputText`, 
or set at the host level (invisible coupling). None of those is right.

**Proposed solution.** Two halves:

- **Editor accordion + persistence.** New `EnvVarsAccordion.jsx` between Data 
Store Configuration and Agent Code in the agent editor. Three columns: Key / 
Value / Description. POSIX-style key validation. Persists as `env_vars: 
List[AgentEnvVar]` on the Agent model.
- **Runtime injection via contextvar-bound environ proxy.** Mutating 
`os.environ` directly under a lock would serialize all agent runs. Instead 
install an `_EnvironProxy` wrapping `os.environ` that consults a contextvar for 
the per-task overlay. Each run sets the contextvar before invoking the agent 
function. asyncio tasks inherit contextvar context, so concurrent runs see 
different overlays without locking.

**What this is NOT.** Not for secrets — values are plaintext on the agent doc 
and visible in trace events. The user-profile API Keys feature handles secrets.

**Effort.** ~120 LOC backend + ~280 LOC frontend + ~80 LOC tests.

**Open questions.**
- Subclass `os._Environ` vs. monkey-patch `os.environ.__class__.__getitem__`. 
Both are hacks. Subclass is more explicit; monkey-patch is shorter. Recommend 
subclass.
- UI without runtime is misleading (saves values, agents don't see them). 
Recommend shipping both halves at once.

---

### Issue: Composer-LLM ignores multi-key output schema

**Problem.** When a user declares an output schema with multiple keys, the 
composer LLM frequently generates code that returns `{outputText: ...}` only, 
ignoring the declared schema. The validator at 
`dependencies.py:validate_output_against_schema` correctly flags the mismatch 
as a warning ("missing required output keys" + "unexpected output key 
outputText"), but the agent code itself was generated wrong, so every run 
produces noise.

This is now layered on top of issue (2) above, which fixed the upstream bug 
where the runs page was sending the *default* `{outputText: "string"}` schema 
as if it were the user's declared schema. With (2) fixed, the user's actual 
schema does flow into the composer prompt — but the LLM still ignores it.

**Proposed solution.** Iterate on the composer prompt at 
`agent_factory/prompts.py`. The current prompt (around 
`output_schema=output_schema_str`, line 140 area) lists the schema in the 
instructions but the LLM doesn't reliably comply.

Three angles, increasing effort:

- Strengthen the prompt — try few-shot examples of correct vs incorrect 
returns, more emphatic phrasing, place the schema in the user message instead 
of the system prompt for higher signal.
- Validate generated code post-hoc — parse the AST, check the `return` 
statement matches the schema keys; if not, regenerate or surface an error 
before saving.
- Generate the return statement directly from the schema — emit a `return 
{key1: ..., key2: ...}` skeleton the LLM fills in, rather than asking for a 
free-form return.

Recommended starting point: prompt iteration. The validator already exists; 
tightening compliance from the prompt side is the smallest change.

---

### Issue (filed): DynamoDB backend bulk APIs

Already filed alongside the DB perf commit. The default loop fallback works, 
but DynamoDB users don't get the bulk-API wins until `BatchWriteItem` (25/req 
chunking) and `BatchGetItem` (100/req chunking) with `UnprocessedItems` retry 
are implemented in `services/database_service/dynamodb.py`. References commit 
`3d88b234c06d2c977f8ca4bb043b529d21f4e9d1`.

---

### Roadmap (separate plan): headless invocation auth

Filed earlier as a follow-up, not as a single issue. The deployed-agent 
endpoint `/rest/<friendly_name>` and the run-code endpoints all gate on 
`get_current_user`, which only knows how to read the `gofannon_sid` cookie. CI 
runners can't get that cookie without the user-driven login flow. Three 
candidate directions:

- Per-agent API keys (smallest scope, most natural CI integration, ~200 LOC).
- Per-user service-account tokens (more flexible, larger blast radius if 
leaked).
- GitHub OIDC (cleanest secret-handling, GHA-specific).

Each is a real feature, not a quick patch. Token security has long-tail 
concerns (rotation, scoping, audit, rate limiting per key, revocation, audit 
logging) that are easy to underspec. Serializes naturally on the run registry — 
without async start + polling, headless agent calls of any meaningful duration 
are unworkable.

Recommend filing per-agent API keys (option a) first since it's the most scoped 
and unblocks the immediate ask.

---

## Suggested filing order

The first three open backlog items (run registry, runs page subscribes, home 
page module) are best filed as one tracking issue with the others as sub-tasks, 
since the user-visible payoff of multi-tenancy doesn't land until all three are 
in. Stop button and env vars are independent and can be standalone issues. 
Composer-LLM schema compliance is also independent and should be its own issue. 
Headless auth is its own RFC-style document, not a single issue.

GitHub link: https://github.com/apache/tooling-gofannon/discussions/6

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[D] Q2 roadmap [tooling-gofannon]

Reply via email to