[PR] skill-evals: field-aware grading, intersection key matching, wrap non-JSON output [airflow-steward]

via GitHub Thu, 28 May 2026 23:52:40 -0700


justinmclean opened a new pull request, #370:
URL: https://github.com/apache/airflow-steward/pull/370


   ## Summary
   
   The exact-equality comparator in `tools/skill-evals` was rejecting
   candidate answers that carried the right decision but differed in
   wording, and was failing whole cases on schema details the model
   chose not to emit. This change rewrites comparison around three
   ideas, all default-on with `--exact` as the opt-out.
   
   1. Field-aware grading. Decision fields (booleans, enums, counts,
      ordering, IDs) stay on exact equality. Free-text fields
      (rationale, reason, drop_reason, blockers, notes, summary,
      explanation, details, description) are sent to a cheap judge
      model with a fixed rubric and parsed back as
      `{match, reason}`. Default grader is `claude -p --model haiku`;
      override with `--grader-cli "<command>"`. Per-fixtures-dir
      `grading-schema.json` overrides the prose-field set.
   
   2. Batched grader call. All prose-field mismatches for a single
      case are collected and sent in one rubric prompt, so a case
      with N wording diffs costs one Haiku call, not N. Decision-
      field failures skip the grader entirely.
   
   3. Intersection-only key matching. Only keys present in both
      actual and expected are asserted. Extra keys in the model
      output are ignored; keys declared in expected that the model
      did not emit are skipped, not failed. expected.json describes
      what the answer should say where it speaks, not a required-
      keys schema. Suite authors should keep expected.json focused
      on the keys that carry the eval's signal.
   
   Non-JSON model output and non-zero CLI exits are now wrapped
   silently as `{raw_output, ...}` and fed to the same intersection
   comparator. Cases that produce a prose refusal still pass unless
   expected.json declares a `raw_output` key. Statuses remain
   PASS / FAIL / ERROR / MANUAL; the wrap is an implementation
   detail. `--exact` mode preserves the previous strict behavior end
   to end.
   
   Test coverage: 40 cases across the new helpers
   (`collect_diffs`, `batch_grade_prose_fields`, `compare_with_grader`)
   and the end-to-end `--cli` + `--grader-cli` path, plus three small
   grader stub scripts under `tests/`. README updated.
   
   ## Type of change
   
   <!-- Tick all that apply. -->
   
   - [X] Skill change (`.claude/skills/<name>/`) — eval fixtures updated below
   - [ ] Tool / bridge contract (`tools/<system>/*.md`)
   - [ ] Python package (`tools/*/` with `pyproject.toml`)
   - [ ] Groovy reference impl
   - [ ] Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
   - [ ] Documentation (`docs/`, `README.md`, `CONTRIBUTING.md`)
   - [ ] Project template (`projects/_template/`)
   - [ ] CI / dev loop (`prek`, workflows, validators)
   - [ ] Other:
   
   ## Test plan
   
   - [X] `prek run --all-files` passes
   - [ ] For Python packages touched: `uv run pytest` / `ruff check` / `mypy` 
passes
   - [ ] For Groovy bridges touched: command-line invocation tested end-to-end
   - [X] For skill changes: eval suite passes for the affected skill
         (`PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner 
tools/skill-evals/evals/<skill>/`)
   - [ ] For skill *behaviour* changes: a new or updated eval fixture is 
included in this PR
         (a regression test for the bug fixed / the behaviour added — see 
CONTRIBUTING.md)
   - [ ] Other:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] skill-evals: field-aware grading, intersection key matching, wrap non-JSON output [airflow-steward]

Reply via email to