justinmclean opened a new pull request, #252:
URL: https://github.com/apache/airflow-steward/pull/252

   > **Generated by the spec-driven build loop.** This eval suite and the docs
   > update were produced by an autonomous run of `tools/spec-loop` 
(`./loop.sh` —
   > one work item, one branch, one PR). Authored by Claude (see the 
`Generated-by`
   > commit trailer) and reviewed + tested by a human before submission.
   
   ## What
   
   Adds the missing `intervention` eval suite (8 cases) to the existing
   `pr-management-mentor` skill's eval tree, covering the intervention-selection
   decision: the out-of-scope and maintainer-engaged checks, the four 
intervention
   templates, the multi-trigger (`ask`) path, the no-trigger (`silent`) path, 
and
   the hand-off triggers.
   
   Also syncs `docs/modes.md`: the Mentoring row moves from `proposed / 0 
skills`
   to `experimental / 1 skill`, and the section points at the shipped skill 
rather
   than a forward reference.
   
   ## Why
   
   `pr-management-mentor` shipped without a matching eval suite for its
   intervention-selection step, and the framework treats a skill without evals 
as
   incomplete. This back-fills that coverage so the skill's decision logic is
   pinned by fixtures.
   
   ## Changes
   
   - `tools/skill-evals/evals/pr-management-mentor/intervention/` — 8-case eval
     suite (system prompt, user-prompt template, case fixtures).
   - `docs/modes.md` — Mentoring row + skill table.
   
   ## Testing — and an issue the loop did not detect
   
   The suite assembles cleanly and, after the fix below, an independent agent 
run
   matches all 8 cases against their ground truth.
   
   On the **first** test pass, case-4 (`why-pushback`) failed. Its 
`expected.json`
   is `handoff` — which is correct per the skill: the contributor argues *after*
   the agent already answered the "why" once, firing the skill's **hand-off 
trigger
   2** ("answer the why once, don't argue", defined in `hand-off.md`). But the
   eval's own `system-prompt.md` never encoded the hand-off triggers, so a model
   following the prompt returned `draft / 4` instead. The build loop generated 
an
   eval whose ground truth assumed skill logic its own prompt left out — a
   self-inconsistency **the loop did not catch**. Fixed here by adding the four
   hand-off triggers (documented `4 → 3 → 1 → 2` order) to `system-prompt.md`;
   re-running the independent check then passed 8/8.
   
   ## Notes
   
   - Eval + docs only; no skill behaviour (`SKILL.md`) is changed.
   - The loop-detection gap is called out deliberately: it's a concrete data 
point
     that the build loop needs a self-consistency check between an eval's 
fixtures
     and the prompt it ships.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to