justinmclean opened a new pull request, #206:
URL: https://github.com/apache/airflow-steward/pull/206
### What
59 new behavioral eval cases across three skills, plus two bug fixes to the
skills themselves
that were uncovered during eval authoring and test runs.
---
### Evals added
| Skill | Suite | Cases | What it covers |
|---|---|---|---|
| `pr-management-triage` | `pre-filter` | 10 | F1–F6, row-6, row-7a
pre-filters |
| `pr-management-triage` | `decision-table` | 16 | Rows 7b, 9–16, 18–22 |
| `pr-management-stats` | `classify` | 6 | Triage marker detection, both
marker forms, sub-states, staleness, legacy HTML comment |
| `pr-management-stats` | `pressure-weight` | 7 | All weight tiers: 0
(collaborator), 0 (draft), 1 (ready label), 1 (fresh untriaged), 2 (stale
triaged_waiting), 3 (untriaged 7–28 d), 5 (untriaged ≥28 d) |
| `pr-management-mentor` | `tone-checks` | 15 | Hard-fail rules 1–10,
soft-fail rules 11–14, clean pass |
| `pr-management-mentor` | `hand-off` | 5 | All four triggers (priority
4→3→1→2), no-trigger baseline |
Each suite follows the existing `pr-management-code-review` fixture
structure:
`system-prompt.md` + `user-prompt-template.md` + per-case `report.md` /
`expected.json`.
READMEs added to all three eval directories.
---
### Skill bug fixes
#### 1. `pr-management-stats/classify.md` — staleness rule contradicted
`triaged_responded`
The triage marker definition required the marker comment to post-date the
PR's last commit.
The `triaged_responded` sub-state said a post-triage commit by the author
counts as a response.
These two rules were directly contradictory: if an author pushes after
triage, the staleness
rule would un-triage the PR before the `triaged_responded` path could apply.
**Fix:** added an explicit exception to the staleness bullet — "if the PR
author subsequently
pushes a commit after the triage comment, do not treat the marker as stale;
classify as
`triaged_responded` instead." The exception now lives on the staleness rule
itself rather than
being a hidden override buried two sections below.
#### 2. `pr-management-triage/classify-and-act.md` — Row 16 missing `author
NOT first-time` guard (eval only)
Row 16 ("no real CI ran → rebase") in the decision-table eval system prompt
was missing the
guard that exists in the real skill: `AND author NOT first-time`. Without
it, a first-time
contributor's stale draft would be routed to `rebase` instead of the
stale-sweep path (Row 21).
The real `classify-and-act.md` already has this guard; the eval system
prompt was an incomplete
transcription.
**Fix:** added `AND authorAssociation NOT IN {FIRST_TIME_CONTRIBUTOR,
FIRST_TIMER}` to Row 16
in the eval decision-table system prompt to match the live skill.
---
### Checklist
- [x] 59 eval cases load cleanly with `skill-eval` runner
- [x] All 59 cases pass model evaluation
- [x] `classify.md` exception is consistent with the existing
`triaged_responded` prose (lines 50–53)
- [x] Eval Row 16 guard matches `classify-and-act.md` line 93
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]