[PR] Add behavioral evals for pr-management-triage, pr-management-stats, and pr-management-mentor; fix two spec bugs [airflow-steward]

via GitHub Sun, 17 May 2026 22:04:20 -0700


justinmclean opened a new pull request, #206:
URL: https://github.com/apache/airflow-steward/pull/206


   ### What
   
   59 new behavioral eval cases across three skills, plus two bug fixes to the 
skills themselves
   that were uncovered during eval authoring and test runs.
   
   ---
   
   ### Evals added
   
   | Skill | Suite | Cases | What it covers |
   |---|---|---|---|
   | `pr-management-triage` | `pre-filter` | 10 | F1–F6, row-6, row-7a 
pre-filters |
   | `pr-management-triage` | `decision-table` | 16 | Rows 7b, 9–16, 18–22 |
   | `pr-management-stats` | `classify` | 6 | Triage marker detection, both 
marker forms, sub-states, staleness, legacy HTML comment |
   | `pr-management-stats` | `pressure-weight` | 7 | All weight tiers: 0 
(collaborator), 0 (draft), 1 (ready label), 1 (fresh untriaged), 2 (stale 
triaged_waiting), 3 (untriaged 7–28 d), 5 (untriaged ≥28 d) |
   | `pr-management-mentor` | `tone-checks` | 15 | Hard-fail rules 1–10, 
soft-fail rules 11–14, clean pass |
   | `pr-management-mentor` | `hand-off` | 5 | All four triggers (priority 
4→3→1→2), no-trigger baseline |
   
   Each suite follows the existing `pr-management-code-review` fixture 
structure:
   `system-prompt.md` + `user-prompt-template.md` + per-case `report.md` / 
`expected.json`.
   READMEs added to all three eval directories.
   
   ---
   
   ### Skill bug fixes
   
   #### 1. `pr-management-stats/classify.md` — staleness rule contradicted 
`triaged_responded`
   
   The triage marker definition required the marker comment to post-date the 
PR's last commit.
   The `triaged_responded` sub-state said a post-triage commit by the author 
counts as a response.
   These two rules were directly contradictory: if an author pushes after 
triage, the staleness
   rule would un-triage the PR before the `triaged_responded` path could apply.
   
   **Fix:** added an explicit exception to the staleness bullet — "if the PR 
author subsequently
   pushes a commit after the triage comment, do not treat the marker as stale; 
classify as
   `triaged_responded` instead." The exception now lives on the staleness rule 
itself rather than
   being a hidden override buried two sections below.
   
   #### 2. `pr-management-triage/classify-and-act.md` — Row 16 missing `author 
NOT first-time` guard (eval only)
   
   Row 16 ("no real CI ran → rebase") in the decision-table eval system prompt 
was missing the
   guard that exists in the real skill: `AND author NOT first-time`. Without 
it, a first-time
   contributor's stale draft would be routed to `rebase` instead of the 
stale-sweep path (Row 21).
   The real `classify-and-act.md` already has this guard; the eval system 
prompt was an incomplete
   transcription.
   
   **Fix:** added `AND authorAssociation NOT IN {FIRST_TIME_CONTRIBUTOR, 
FIRST_TIMER}` to Row 16
   in the eval decision-table system prompt to match the live skill.
   
   ---
   
   ### Checklist
   
   - [x] 59 eval cases load cleanly with `skill-eval` runner
   - [x] All 59 cases pass model evaluation
   - [x] `classify.md` exception is consistent with the existing 
`triaged_responded` prose (lines 50–53)
   - [x] Eval Row 16 guard matches `classify-and-act.md` line 93


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add behavioral evals for pr-management-triage, pr-management-stats, and pr-management-mentor; fix two spec bugs [airflow-steward]

Reply via email to