You can basically think of it as unittests and/or benchmarks for documentation or agent skills (or codebase health). Except since they can't always be pass/fail, we also need something sliding-scale that measures a degree of success/failure.
If we didn't have LLMs, we theoretically could've still "tested" documentation by having new developers who know nothing about the project get locked in a room with a sample coding task. Group A gets updated docs. Group B gets old docs. Measure how many of them succeed and how long they take, ask them how hard the task was. If Group A always takes 30 minutes to finish and group B takes 60 minutes to finish, you have a delta of 30 minutes. On Thu, May 21, 2026 at 12:35 PM Dmitri Bourlatchkov <[email protected]> wrote: > Hi Dennis, > > This proposal looks interesting, but I'm not sure I understand the purpose > :) The doc and the PR give a lot of information about what happens, but > almost nothing about "why" (at least I could not easily deduce that). > > Could you expand your proposal a bit on that aspect? > > More specifically, what is the "quantitative A/B delta" exactly? How is it > envisioned to be used? > > Thanks, > Dmitri. > > On Thu, May 21, 2026 at 5:13 AM Dennis Huo <[email protected]> wrote: > > > Hi all, > > > > Now that agentic development is evolving to be a more fundamental and > > pervasive tool, I wanted to explore ways to address both a "need" and an > > "opportunity" under one umbrella - adding an agentic (meta-)skill to > start > > codifying a way for us to bake in quantifiable metrics to the impact of > > "non-functional" changes on repository "health" (in terms of > extensibility > > and maintainability). > > > > Basically, if we extrapolate from getting into the habit of formalizing > our > > AGENTS.md files towards likely adding well-defined agent "skills" for > > repeatable agentic workflows, and those becoming more ingrained in the > > development process over time, the basic "need" is to standardize our > evals > > against the addition of new skills and mdfile documentation, but also to > > recognize the opportunity of addressing three related types of > > nonfunctional changes: > > > > 1. Refactoring code - sometimes subjective, sometimes partially objective > > (consolidating duplicate code), but the *effects* are rarely quantifiable > > 2. Adding documentation/code comments - Generally regarded as being good, > > but sometimes verbosity can hurt, and certainly "incorrect" documentation > > can hurt > > 3. Addition of agent skills or rules - possibly manually tested to some > > extent when added, but usually not consistently and rarely with > > reproducible evals > > > > To that end I put together this proposal doc with some lightweight design > > elements for this agentic skill: > > > > > > > https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0 > > > > Would love to discuss folks' thoughts here or in comments in the doc. > > Recapping the core concept from the doc: > > > > *Treat any candidate change as an intervention in a measurable A/B. Take > a > > baseline ref and a candidate ref, run a fixed set of agent-driven sample > > tasks against both refs, collect a small number of metrics (success vs. > an > > oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a > > delta report a reviewer can actually interpret.* > > > > And the three component carveouts: > > > > - Static task corpus - hand curated set of initial development tasks > > (e.g. "Add a new Polaris privilege") that provides basic cross-cutting > > signal > > - Task synthesizer - More advanced meta-evolution step - the agentic > > driver of the harness can intelligently synthesize tasks that exercise > > newly identified segments of coding complexity > > - Eval harness - the overall framework for isolating subagents, sets > up > > the task experiments, collects metrics, etc. > > > > I have an initial v1 available for review: > > https://github.com/apache/polaris/pull/4519 > > > > This includes the end-to-end working v1 eval harness and prospective > > initial set of static tasks, no codified task synthesizer yet. I ran an > > initial meta-eval on it with a three models (Claude Haiku 4.5, Claude > Opus > > 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more > > detailed results posted in the PR, abridged here - we should iterate a > bit > > more on the task corpus, but at least it's a proof-of-concept of the > > end-to-end flow. > > > > ## Task & fixture > > > > - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant > > `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`, > > ensure compile + `*PolarisAuthorizer*` tests pass without modifying > > any test file. The task is a *probe* of the authorizer SPI: a naive > > one-file edit (enum only) trips the static initializer in > > `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file > > change (enum + register call) passes. > > - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16). > > - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines — > > "Recipes for Common Extension Tasks" section that explicitly tells > > agents to also edit `RbacOperationSemantics.register(...)`). The > > fixture only changes `AGENTS.md`; no source code differs between BASE > > and AFTER. > > > > The task's deterministic verifier runs out-of-band from the worker > > agent (separate `bash` subprocess after the worker's transcript is > > captured) so worker self-reports cannot fake a PASS. > > > > ## Headline results > > > > | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in > > diff | > > > > > |------|---------|---------:|-----------:|-----------:|------:|---------------| > > | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) | > > | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) | > > | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) | > > | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) | > > | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** | > > | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) | > > > > Per-arm deltas (BEFORE → AFTER, AFTER doc helps): > > > > | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ | > > |--------|-------:|--------:|--------:|-----------| > > | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) | > > | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) | > > | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) | > > > > Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating > > verdict-flip + two consistent ~40% cost reductions on the same > > task — clear, replicable signal that the AGENTS.md recipe addition is > > agent-load-bearing. > > >
