Hi Dennis,

This proposal looks interesting, but I'm not sure I understand the purpose
:) The doc and the PR give a lot of information about what happens, but
almost nothing about "why" (at least I could not easily deduce that).

Could you expand your proposal a bit on that aspect?

More specifically, what is the "quantitative A/B delta" exactly? How is it
envisioned to be used?

Thanks,
Dmitri.

On Thu, May 21, 2026 at 5:13 AM Dennis Huo <[email protected]> wrote:

> Hi all,
>
> Now that agentic development is evolving to be a more fundamental and
> pervasive tool, I wanted to explore ways to address both a "need" and an
> "opportunity" under one umbrella - adding an agentic (meta-)skill to start
> codifying a way for us to bake in quantifiable metrics to the impact of
> "non-functional" changes on repository "health" (in terms of extensibility
> and maintainability).
>
> Basically, if we extrapolate from getting into the habit of formalizing our
> AGENTS.md files towards likely adding well-defined agent "skills" for
> repeatable agentic workflows, and those becoming more ingrained in the
> development process over time, the basic "need" is to standardize our evals
> against the addition of new skills and mdfile documentation, but also to
> recognize the opportunity of addressing three related types of
> nonfunctional changes:
>
> 1. Refactoring code - sometimes subjective, sometimes partially objective
> (consolidating duplicate code), but the *effects* are rarely quantifiable
> 2. Adding documentation/code comments - Generally regarded as being good,
> but sometimes verbosity can hurt, and certainly "incorrect" documentation
> can hurt
> 3. Addition of agent skills or rules - possibly manually tested to some
> extent when added, but usually not consistently and rarely with
> reproducible evals
>
> To that end I put together this proposal doc with some lightweight design
> elements for this agentic skill:
>
>
> https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0
>
> Would love to discuss folks' thoughts here or in comments in the doc.
> Recapping the core concept from the doc:
>
> *Treat any candidate change as an intervention in a measurable A/B. Take a
> baseline ref and a candidate ref, run a fixed set of agent-driven sample
> tasks against both refs, collect a small number of metrics (success vs. an
> oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a
> delta report a reviewer can actually interpret.*
>
> And the three component carveouts:
>
>    - Static task corpus - hand curated set of initial development tasks
>    (e.g. "Add a new Polaris privilege") that provides basic cross-cutting
>    signal
>    - Task synthesizer - More advanced meta-evolution step - the agentic
>    driver of the harness can intelligently synthesize tasks that exercise
>    newly identified segments of coding complexity
>    - Eval harness - the overall framework for isolating subagents, sets up
>    the task experiments, collects metrics, etc.
>
> I have an initial v1 available for review:
> https://github.com/apache/polaris/pull/4519
>
> This includes the end-to-end working v1 eval harness and prospective
> initial set of static tasks, no codified task synthesizer yet. I ran an
> initial meta-eval on it with a three models (Claude Haiku 4.5, Claude Opus
> 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more
> detailed results posted in the PR, abridged here - we should iterate a bit
> more on the task corpus, but at least it's a proof-of-concept of the
> end-to-end flow.
>
> ## Task & fixture
>
> - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
> `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
> ensure compile + `*PolarisAuthorizer*` tests pass without modifying
> any test file. The task is a *probe* of the authorizer SPI: a naive
> one-file edit (enum only) trips the static initializer in
> `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
> change (enum + register call) passes.
> - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
> - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
> "Recipes for Common Extension Tasks" section that explicitly tells
> agents to also edit `RbacOperationSemantics.register(...)`). The
> fixture only changes `AGENTS.md`; no source code differs between BASE
> and AFTER.
>
> The task's deterministic verifier runs out-of-band from the worker
> agent (separate `bash` subprocess after the worker's transcript is
> captured) so worker self-reports cannot fake a PASS.
>
> ## Headline results
>
> | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in
> diff |
>
> |------|---------|---------:|-----------:|-----------:|------:|---------------|
> | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) |
> | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) |
> | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
> | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) |
> | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
> | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) |
>
> Per-arm deltas (BEFORE → AFTER, AFTER doc helps):
>
> | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ |
> |--------|-------:|--------:|--------:|-----------|
> | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) |
> | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) |
> | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) |
>
> Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
> verdict-flip + two consistent ~40% cost reductions on the same
> task — clear, replicable signal that the AGENTS.md recipe addition is
> agent-load-bearing.
>

Reply via email to