Hi Dennis, This proposal looks interesting, but I'm not sure I understand the purpose :) The doc and the PR give a lot of information about what happens, but almost nothing about "why" (at least I could not easily deduce that).
Could you expand your proposal a bit on that aspect? More specifically, what is the "quantitative A/B delta" exactly? How is it envisioned to be used? Thanks, Dmitri. On Thu, May 21, 2026 at 5:13 AM Dennis Huo <[email protected]> wrote: > Hi all, > > Now that agentic development is evolving to be a more fundamental and > pervasive tool, I wanted to explore ways to address both a "need" and an > "opportunity" under one umbrella - adding an agentic (meta-)skill to start > codifying a way for us to bake in quantifiable metrics to the impact of > "non-functional" changes on repository "health" (in terms of extensibility > and maintainability). > > Basically, if we extrapolate from getting into the habit of formalizing our > AGENTS.md files towards likely adding well-defined agent "skills" for > repeatable agentic workflows, and those becoming more ingrained in the > development process over time, the basic "need" is to standardize our evals > against the addition of new skills and mdfile documentation, but also to > recognize the opportunity of addressing three related types of > nonfunctional changes: > > 1. Refactoring code - sometimes subjective, sometimes partially objective > (consolidating duplicate code), but the *effects* are rarely quantifiable > 2. Adding documentation/code comments - Generally regarded as being good, > but sometimes verbosity can hurt, and certainly "incorrect" documentation > can hurt > 3. Addition of agent skills or rules - possibly manually tested to some > extent when added, but usually not consistently and rarely with > reproducible evals > > To that end I put together this proposal doc with some lightweight design > elements for this agentic skill: > > > https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0 > > Would love to discuss folks' thoughts here or in comments in the doc. > Recapping the core concept from the doc: > > *Treat any candidate change as an intervention in a measurable A/B. Take a > baseline ref and a candidate ref, run a fixed set of agent-driven sample > tasks against both refs, collect a small number of metrics (success vs. an > oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a > delta report a reviewer can actually interpret.* > > And the three component carveouts: > > - Static task corpus - hand curated set of initial development tasks > (e.g. "Add a new Polaris privilege") that provides basic cross-cutting > signal > - Task synthesizer - More advanced meta-evolution step - the agentic > driver of the harness can intelligently synthesize tasks that exercise > newly identified segments of coding complexity > - Eval harness - the overall framework for isolating subagents, sets up > the task experiments, collects metrics, etc. > > I have an initial v1 available for review: > https://github.com/apache/polaris/pull/4519 > > This includes the end-to-end working v1 eval harness and prospective > initial set of static tasks, no codified task synthesizer yet. I ran an > initial meta-eval on it with a three models (Claude Haiku 4.5, Claude Opus > 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more > detailed results posted in the PR, abridged here - we should iterate a bit > more on the task corpus, but at least it's a proof-of-concept of the > end-to-end flow. > > ## Task & fixture > > - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant > `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`, > ensure compile + `*PolarisAuthorizer*` tests pass without modifying > any test file. The task is a *probe* of the authorizer SPI: a naive > one-file edit (enum only) trips the static initializer in > `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file > change (enum + register call) passes. > - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16). > - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines — > "Recipes for Common Extension Tasks" section that explicitly tells > agents to also edit `RbacOperationSemantics.register(...)`). The > fixture only changes `AGENTS.md`; no source code differs between BASE > and AFTER. > > The task's deterministic verifier runs out-of-band from the worker > agent (separate `bash` subprocess after the worker's transcript is > captured) so worker self-reports cannot fake a PASS. > > ## Headline results > > | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in > diff | > > |------|---------|---------:|-----------:|-----------:|------:|---------------| > | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) | > | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) | > | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) | > | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) | > | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** | > | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) | > > Per-arm deltas (BEFORE → AFTER, AFTER doc helps): > > | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ | > |--------|-------:|--------:|--------:|-----------| > | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) | > | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) | > | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) | > > Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating > verdict-flip + two consistent ~40% cost reductions on the same > task — clear, replicable signal that the AGENTS.md recipe addition is > agent-load-bearing. >
