Re: [DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Robert Stupp Thu, 28 May 2026 03:22:16 -0700

Hi Dennis,

thanks for writing this up.
I like the general idea of making agent-facing docs/rules less hand-wavy.


One thing I am still struggling with is the model/agent comparison aspect.

A result for "haiku vs opus vs codex" feels hard to interpret, because the
result does not only depend on the model name.
It also depends on the exact model/agent version, context window size,
CLI/tool driving it, injected prompts,
context handling, loaded rules/skills, sandbox/tool permissions, local
Java/Gradle/cache state, timeouts, retries,
network access, operating system/config/binaries, and provider-side model
behavior at a specific point in time.

Even recording all of that only helps explain a result after the fact.
It does not make it reproducible in the same way as a normal coded unit
test is reproducible.
Another contributor usually cannot recreate the exact same
agent/runtime/model state.

So I worry that this axis may look more stable/actionable than it really is.
At most, I think it can support an author's local note like "under my exact
setup, this AGENTS.md change helped on this exact task".
That may be useful background while developing the change.

Maybe the first step is to decide whether the result is meant to provide
evidence for accepting a change, or just supports a contributor locally
while developing agent-facing docs.
I think the latter is much easier to defend.

Because of that, I would prefer not to make the model/agent comparison a
central part of the proposal.
If we still want to experiment with this, I think it should stay outside
the normal validation path until we have a much better answer for what is
actually reproducible and actionable for reviewers.

Robert


On Thu, May 28, 2026 at 3:37 AM Yufei Gu <[email protected]> wrote:

> Hi Dennis
>
> Thanks for raising it. It looks like a cool idea, and +1 on experimenting
> in the Polaris repo to make Polaris more agent-friendly.
>
> My main concern is benchmark coverage and long term maintenance cost.
>
> For coverage, a small static task corpus may overfit to a few known
> workflows or repository conventions. A model could appear to improve simply
> because the benchmark captures patterns already encoded in AGENTS.md, while
> missing broader extensibility or maintainability issues elsewhere in the
> codebase. The task synthesizer direction may help, but generating
> representative and non gameable tasks seems challenging on its own.
>
> For maintenance cost, I suspect the benchmark corpus and verifiers could
> gradually become another subsystem we need to maintain alongside the
> codebase itself. As Polaris evolves, tasks, fixtures, assertions, and
> expected outcomes will drift too. Keeping evals deterministic, stable, and
> still representative over time could become expensive.
>
> That said, I still think the direction is interesting, especially as a
> lightweight signal for agent friendliness. I would probably start with a
> very small and highly deterministic scope first.
>
> Ideally the evaluation could run in CI, but getting an LLM sponsor may be
> difficult. In practice, contributors may need to run the evals themselves.
> With that, I suggest integrating with Gradle commands or creating commands
> to make local execution easier.
>
> Yufei
>
>
> On Thu, May 21, 2026 at 2:13 AM Dennis Huo <[email protected]> wrote:
>
> > Hi all,
> >
> > Now that agentic development is evolving to be a more fundamental and
> > pervasive tool, I wanted to explore ways to address both a "need" and an
> > "opportunity" under one umbrella - adding an agentic (meta-)skill to
> start
> > codifying a way for us to bake in quantifiable metrics to the impact of
> > "non-functional" changes on repository "health" (in terms of
> extensibility
> > and maintainability).
> >
> > Basically, if we extrapolate from getting into the habit of formalizing
> our
> > AGENTS.md files towards likely adding well-defined agent "skills" for
> > repeatable agentic workflows, and those becoming more ingrained in the
> > development process over time, the basic "need" is to standardize our
> evals
> > against the addition of new skills and mdfile documentation, but also to
> > recognize the opportunity of addressing three related types of
> > nonfunctional changes:
> >
> > 1. Refactoring code - sometimes subjective, sometimes partially objective
> > (consolidating duplicate code), but the *effects* are rarely quantifiable
> > 2. Adding documentation/code comments - Generally regarded as being good,
> > but sometimes verbosity can hurt, and certainly "incorrect" documentation
> > can hurt
> > 3. Addition of agent skills or rules - possibly manually tested to some
> > extent when added, but usually not consistently and rarely with
> > reproducible evals
> >
> > To that end I put together this proposal doc with some lightweight design
> > elements for this agentic skill:
> >
> >
> >
> https://docs.google.com/document/d/1RE5mGcrMLbmi8sglkHuJKxORVNiuiZ69da1weqwpGjE/edit?tab=t.0
> >
> > Would love to discuss folks' thoughts here or in comments in the doc.
> > Recapping the core concept from the doc:
> >
> > *Treat any candidate change as an intervention in a measurable A/B. Take
> a
> > baseline ref and a candidate ref, run a fixed set of agent-driven sample
> > tasks against both refs, collect a small number of metrics (success vs.
> an
> > oracle, wall-clock, tokens, agent rounds, crash count, etc), and emit a
> > delta report a reviewer can actually interpret.*
> >
> > And the three component carveouts:
> >
> >    - Static task corpus - hand curated set of initial development tasks
> >    (e.g. "Add a new Polaris privilege") that provides basic cross-cutting
> >    signal
> >    - Task synthesizer - More advanced meta-evolution step - the agentic
> >    driver of the harness can intelligently synthesize tasks that exercise
> >    newly identified segments of coding complexity
> >    - Eval harness - the overall framework for isolating subagents, sets
> up
> >    the task experiments, collects metrics, etc.
> >
> > I have an initial v1 available for review:
> > https://github.com/apache/polaris/pull/4519
> >
> > This includes the end-to-end working v1 eval harness and prospective
> > initial set of static tasks, no codified task synthesizer yet. I ran an
> > initial meta-eval on it with a three models (Claude Haiku 4.5, Claude
> Opus
> > 4.7, and Codex GPT 5.4) and just the "add new privilege" task; more
> > detailed results posted in the PR, abridged here - we should iterate a
> bit
> > more on the task corpus, but at least it's a proof-of-concept of the
> > end-to-end flow.
> >
> > ## Task & fixture
> >
> > - **Task**: `tasks/seed/T-priv-add.yaml` — add the enum constant
> > `LIST_NAMESPACE_TABLES_RECURSIVE` to `PolarisAuthorizableOperation`,
> > ensure compile + `*PolarisAuthorizer*` tests pass without modifying
> > any test file. The task is a *probe* of the authorizer SPI: a naive
> > one-file edit (enum only) trips the static initializer in
> > `RbacOperationSemantics.java` and breaks 4 tests; the correct two-file
> > change (enum + register call) passes.
> > - **BEFORE ref**: `568a8883` (Polaris main HEAD on 2026-05-16).
> > - **AFTER ref**: `c9b37227` (TEMP local fixture: AGENTS.md +100 lines —
> > "Recipes for Common Extension Tasks" section that explicitly tells
> > agents to also edit `RbacOperationSemantics.register(...)`). The
> > fixture only changes `AGENTS.md`; no source code differs between BASE
> > and AFTER.
> >
> > The task's deterministic verifier runs out-of-band from the worker
> > agent (separate `bash` subprocess after the worker's transcript is
> > captured) so worker self-reports cannot fake a PASS.
> >
> > ## Headline results
> >
> > | Cell | Verdict | Wall (s) | Cost (USD) | Tokens out | Turns | Files in
> > diff |
> >
> >
> |------|---------|---------:|-----------:|-----------:|------:|---------------|
> > | haiku-base | PASS | 270 | $0.362 | 9374 | 59 | 2 (enum + Rbac) |
> > | haiku-after | PASS | 157 | $0.226 | 5657 | 36 | 2 (enum + Rbac) |
> > | opus-base | PASS | 204 | $1.481 | 10112 | 24 | 2 (enum + Rbac) |
> > | opus-after | PASS | 124 | $0.854 | 5150 | 15 | 2 (enum + Rbac) |
> > | codex-base | **FAIL** | 37 | n/a | n/a | n/a | **1 (enum only)** |
> > | codex-after | PASS | 39 | n/a | n/a | n/a | 2 (enum + Rbac) |
> >
> > Per-arm deltas (BEFORE → AFTER, AFTER doc helps):
> >
> > | Model | Wall Δ | Cost Δ | Turns Δ | Verdict Δ |
> > |--------|-------:|--------:|--------:|-----------|
> > | haiku | -42% | -38% | -39% | PASS → PASS (soft-improvement) |
> > | opus | -39% | -42% | -38% | PASS → PASS (soft-improvement) |
> > | codex | +5% | n/a | n/a | **FAIL → PASS** (hard improvement) |
> >
> > Total: 6 cells, 13m 49s wall, $2.92 spend. One discriminating
> > verdict-flip + two consistent ~40% cost reductions on the same
> > task — clear, replicable signal that the AGENTS.md recipe addition is
> > agent-load-bearing.
> >
>

Re: [DISCUSS] Proposal - Agentic Eval (Meta-)Skill for Extensibility and Maintainability

Reply via email to