justinmclean opened a new pull request, #370:
URL: https://github.com/apache/airflow-steward/pull/370
## Summary
The exact-equality comparator in `tools/skill-evals` was rejecting
candidate answers that carried the right decision but differed in
wording, and was failing whole cases on schema details the model
chose not to emit. This change rewrites comparison around three
ideas, all default-on with `--exact` as the opt-out.
1. Field-aware grading. Decision fields (booleans, enums, counts,
ordering, IDs) stay on exact equality. Free-text fields
(rationale, reason, drop_reason, blockers, notes, summary,
explanation, details, description) are sent to a cheap judge
model with a fixed rubric and parsed back as
`{match, reason}`. Default grader is `claude -p --model haiku`;
override with `--grader-cli "<command>"`. Per-fixtures-dir
`grading-schema.json` overrides the prose-field set.
2. Batched grader call. All prose-field mismatches for a single
case are collected and sent in one rubric prompt, so a case
with N wording diffs costs one Haiku call, not N. Decision-
field failures skip the grader entirely.
3. Intersection-only key matching. Only keys present in both
actual and expected are asserted. Extra keys in the model
output are ignored; keys declared in expected that the model
did not emit are skipped, not failed. expected.json describes
what the answer should say where it speaks, not a required-
keys schema. Suite authors should keep expected.json focused
on the keys that carry the eval's signal.
Non-JSON model output and non-zero CLI exits are now wrapped
silently as `{raw_output, ...}` and fed to the same intersection
comparator. Cases that produce a prose refusal still pass unless
expected.json declares a `raw_output` key. Statuses remain
PASS / FAIL / ERROR / MANUAL; the wrap is an implementation
detail. `--exact` mode preserves the previous strict behavior end
to end.
Test coverage: 40 cases across the new helpers
(`collect_diffs`, `batch_grade_prose_fields`, `compare_with_grader`)
and the end-to-end `--cli` + `--grader-cli` path, plus three small
grader stub scripts under `tests/`. README updated.
## Type of change
<!-- Tick all that apply. -->
- [X] Skill change (`.claude/skills/<name>/`) — eval fixtures updated below
- [ ] Tool / bridge contract (`tools/<system>/*.md`)
- [ ] Python package (`tools/*/` with `pyproject.toml`)
- [ ] Groovy reference impl
- [ ] Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
- [ ] Documentation (`docs/`, `README.md`, `CONTRIBUTING.md`)
- [ ] Project template (`projects/_template/`)
- [ ] CI / dev loop (`prek`, workflows, validators)
- [ ] Other:
## Test plan
- [X] `prek run --all-files` passes
- [ ] For Python packages touched: `uv run pytest` / `ruff check` / `mypy`
passes
- [ ] For Groovy bridges touched: command-line invocation tested end-to-end
- [X] For skill changes: eval suite passes for the affected skill
(`PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner
tools/skill-evals/evals/<skill>/`)
- [ ] For skill *behaviour* changes: a new or updated eval fixture is
included in this PR
(a regression test for the bug fixed / the behaviour added — see
CONTRIBUTING.md)
- [ ] Other:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]