Re: [PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

via GitHub Sat, 16 May 2026 05:06:29 -0700


potiuk commented on code in PR #158:
URL: https://github.com/apache/airflow-steward/pull/158#discussion_r3252772160



##########
tools/skill-evals/src/skill_evals/runner.py:
##########
@@ -0,0 +1,251 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Skill eval runner.
+
+Loads fixture data from an eval case directory and prints the system prompt
+and user prompt for each case so you can paste them into any model.
+Compare the model's response against expected.json to verify correctness.
+
+Usage:
+    # Print prompts for all cases under a fixtures directory
+    uv run --project tools/skill-evals skill-eval \\
+        evals/security-issue-import/step-2a-semantic-sweep/fixtures/
+
+    # Print prompt for a single case
+    uv run --project tools/skill-evals skill-eval \\
+        
evals/security-issue-import/step-2a-semantic-sweep/fixtures/case-1-clear-duplicate
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+# ---------------------------------------------------------------------------
+# Prompt construction
+# ---------------------------------------------------------------------------
+
+USER_PROMPT_TEMPLATE = """\
+## Existing open trackers (corpus)
+
+{corpus}
+
+## Reporter roster (existing trackers mapped to reporter email)
+
+{roster}
+
+## Incoming report
+
+{report}
+
+Apply the semantic sweep and reporter-identity check. Return JSON only.
+"""
+
+
+def build_corpus_text(corpus: list[dict]) -> str:
+    lines = []
+    for item in corpus:
+        lines.append(f"#{item['number']} | {item['title']!r}")
+        lines.append(f"Body (first 300 chars): {item['body']}")
+        lines.append("")
+    return "\n".join(lines)
+
+
+def build_roster_text(roster: dict[str, str]) -> str:
+    if not roster:
+        return "(none)"
+    return "\n".join(f"#{num}: {email}" for num, email in roster.items())
+
+
+def find_repo_root(start: Path) -> Path:
+    """Walk up the directory tree until a .git directory is found."""
+    p = start.resolve()
+    while p != p.parent:
+        if (p / ".git").exists():
+            return p
+        p = p.parent
+    raise RuntimeError(f"Could not find repo root (.git) from {start}")
+
+
+def extract_skill_section(skill_md_path: Path, heading: str) -> str:
+    """Return the section of a SKILL.md that begins with *heading*.
+
+    Extraction ends at the next heading of the same or higher level, or at
+    the end of the file.  Raises ValueError if the heading is not found.
+    """
+    text = skill_md_path.read_text()
+    lines = text.split("\n")
+    heading_stripped = heading.rstrip()
+    heading_level = len(heading) - len(heading.lstrip("#"))
+
+    start = next(
+        (i for i, line in enumerate(lines) if line.rstrip() == 
heading_stripped),
+        None,
+    )
+    if start is None:
+        raise ValueError(f"Heading {heading!r} not found in {skill_md_path}")
+
+    end = len(lines)
+    for i in range(start + 1, len(lines)):
+        depth = len(lines[i]) - len(lines[i].lstrip("#"))
+        if depth > 0 and depth <= heading_level and 
lines[i].lstrip("#").startswith(" "):
+            end = i
+            break
+
+    return "\n".join(lines[start:end]).rstrip()
+
+
+def load_step_config(fixtures_dir: Path) -> tuple[str, str]:
+    """Return (system_prompt, user_prompt_template) for the given fixtures dir.
+
+    Resolution order:
+    1. ``step-config.json`` — extracts the step section live from the skill's
+       SKILL.md, then appends ``output-spec.md`` if present.  This is the
+       preferred path: tests automatically exercise the current skill text.
+    2. ``system-prompt.md`` — a manually maintained prompt used by triage 
steps.
+
+    Raises FileNotFoundError if neither file is present.
+    """
+    user_tmpl_path = fixtures_dir / "user-prompt-template.md"
+    user_prompt_template = user_tmpl_path.read_text() if 
user_tmpl_path.exists() else USER_PROMPT_TEMPLATE
+
+    # 1. step-config.json → live extraction from SKILL.md
+    config_path = fixtures_dir / "step-config.json"
+    if config_path.exists():
+        config = json.loads(config_path.read_text())
+        repo_root = find_repo_root(fixtures_dir)
+        skill_md_path = repo_root / config["skill_md"]
+        section = extract_skill_section(skill_md_path, config["step_heading"])
+        output_spec_path = fixtures_dir / "output-spec.md"
+        if output_spec_path.exists():
+            section += "\n\n" + output_spec_path.read_text()
+        return section, user_prompt_template
+
+    # 2. system-prompt.md → manually maintained (triage steps)
+    sys_prompt_path = fixtures_dir / "system-prompt.md"
+    if sys_prompt_path.exists():
+        return sys_prompt_path.read_text(), user_prompt_template
+
+    raise FileNotFoundError(
+        f"{fixtures_dir} has neither step-config.json nor system-prompt.md. "
+        "Add a step-config.json pointing at the relevant SKILL.md section."
+    )
+
+
+# ---------------------------------------------------------------------------
+# Case loading
+# ---------------------------------------------------------------------------
+
+
+def load_case(case_dir: Path) -> tuple[list[dict], dict, str, dict]:
+    """Return (corpus, roster, report_text, expected).
+
+    ``corpus.json`` is optional — steps that do not need a tracker corpus
+    (e.g. Step 3 classification) simply omit it and get an empty list.
+    """
+    fixtures_dir = case_dir.parent
+    corpus_path = fixtures_dir / "corpus.json"
+    roster_path = fixtures_dir / "reporter-roster.json"
+
+    corpus = json.loads(corpus_path.read_text()) if corpus_path.exists() else 
[]
+    roster = json.loads(roster_path.read_text()) if roster_path.exists() else 
{}
+    report = (case_dir / "report.md").read_text()
+    expected = json.loads((case_dir / "expected.json").read_text())
+    return corpus, roster, report, expected
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def find_cases(path: Path) -> list[tuple[Path, Path]]:
+    """Return (case_dir, fixtures_dir) pairs under path.
+
+    Handles three levels of granularity:
+      - single case dir     (contains report.md)
+      - fixtures dir        (contains case-* subdirs)
+      - skill/step dir      (contains fixtures/ subdirs recursively)
+    """
+    if (path / "report.md").exists():
+        return [(path, path.parent)]
+    # Direct fixtures dir — all cases share the same fixtures dir.
+    direct = sorted(p for p in path.iterdir() if p.is_dir() and (p / 
"report.md").exists())
+    if direct:
+        return [(p, path) for p in direct]
+    # Recursive search — e.g. skill dir spanning multiple steps.
+    results = []
+    for fixtures_dir in sorted(path.rglob("fixtures")):

Review Comment:
   Minor: `rglob("fixtures")` matches every `fixtures` directory under `path`. 
Today the layout has one level, but a stray nested `fixtures/` (someone 
copy-pasting a sub-case) would be double-counted. Worth either restricting 
depth or de-duping by path.



##########
tools/skill-evals/src/skill_evals/runner.py:
##########
@@ -0,0 +1,251 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Skill eval runner.
+
+Loads fixture data from an eval case directory and prints the system prompt
+and user prompt for each case so you can paste them into any model.
+Compare the model's response against expected.json to verify correctness.
+
+Usage:
+    # Print prompts for all cases under a fixtures directory
+    uv run --project tools/skill-evals skill-eval \\
+        evals/security-issue-import/step-2a-semantic-sweep/fixtures/
+
+    # Print prompt for a single case
+    uv run --project tools/skill-evals skill-eval \\
+        
evals/security-issue-import/step-2a-semantic-sweep/fixtures/case-1-clear-duplicate
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+# ---------------------------------------------------------------------------
+# Prompt construction
+# ---------------------------------------------------------------------------
+
+USER_PROMPT_TEMPLATE = """\
+## Existing open trackers (corpus)
+
+{corpus}
+
+## Reporter roster (existing trackers mapped to reporter email)
+
+{roster}
+
+## Incoming report
+
+{report}
+
+Apply the semantic sweep and reporter-identity check. Return JSON only.
+"""
+
+
+def build_corpus_text(corpus: list[dict]) -> str:
+    lines = []
+    for item in corpus:
+        lines.append(f"#{item['number']} | {item['title']!r}")
+        lines.append(f"Body (first 300 chars): {item['body']}")
+        lines.append("")
+    return "\n".join(lines)
+
+
+def build_roster_text(roster: dict[str, str]) -> str:
+    if not roster:
+        return "(none)"
+    return "\n".join(f"#{num}: {email}" for num, email in roster.items())
+
+
+def find_repo_root(start: Path) -> Path:
+    """Walk up the directory tree until a .git directory is found."""
+    p = start.resolve()
+    while p != p.parent:
+        if (p / ".git").exists():
+            return p
+        p = p.parent
+    raise RuntimeError(f"Could not find repo root (.git) from {start}")
+
+
+def extract_skill_section(skill_md_path: Path, heading: str) -> str:
+    """Return the section of a SKILL.md that begins with *heading*.
+
+    Extraction ends at the next heading of the same or higher level, or at
+    the end of the file.  Raises ValueError if the heading is not found.
+    """
+    text = skill_md_path.read_text()
+    lines = text.split("\n")
+    heading_stripped = heading.rstrip()
+    heading_level = len(heading) - len(heading.lstrip("#"))
+
+    start = next(
+        (i for i, line in enumerate(lines) if line.rstrip() == 
heading_stripped),
+        None,
+    )
+    if start is None:
+        raise ValueError(f"Heading {heading!r} not found in {skill_md_path}")
+
+    end = len(lines)
+    for i in range(start + 1, len(lines)):
+        depth = len(lines[i]) - len(lines[i].lstrip("#"))
+        if depth > 0 and depth <= heading_level and 
lines[i].lstrip("#").startswith(" "):

Review Comment:
   This depth check has a confirmed bug. `lstrip("#")` strips *all* leading 
`#`s, so a python/shell comment like `# Leading passes twice…` satisfies 
`depth=1, lstrip("#").startswith(" ")` and is misread as a heading that 
terminates the section.
   
   Concrete repro on this branch: 
`.claude/skills/security-cve-allocate/SKILL.md` line 309 (`# Leading passes 
twice — strip order reveals nested tags.`) falls inside Step 2 (heading at line 
267). When `step-2-title-normalize/fixtures/step-config.json` runs through 
`load_step_config`, the extracted section is silently truncated at 309 — 
`case-4-over-strip-warning` is very likely passing against a half-prompt rather 
than the real Step 2 text.
   
   Fix: track fenced-code-block state and use `re.match(r"^#{1,6} ", line)` for 
"this is a markdown heading" instead of the `lstrip` trick.



##########
tools/skill-evals/README.md:
##########
@@ -0,0 +1,86 @@
+# skill-evals
+
+Behavioral eval harness for Apache Steward skills. Each eval suite tests a 
skill pipeline step by step, verifying that the model produces the correct 
structured JSON output for a fixed set of fixture cases.
+
+Nine suites are currently implemented (206 cases total):
+
+- **security-issue-import** — 32 cases across 8 steps
+- **security-issue-triage** — 33 cases across 9 steps
+- **security-issue-deduplicate** — 18 cases across 6 steps (steps 1, 2, 3, 4, 
5, 6)
+- **security-cve-allocate** — 20 cases across 6 steps (steps 1, 2, 3, 4, 5, 7)
+- **security-issue-sync** — 25 cases across 7 steps (1f, 2a, 2b, 2c, 3, 6, 
guardrails)
+- **security-issue-fix** — 30 cases across 10 steps (2, 4a, 4b, 4c, 4d, 4e, 
4f, 4g, 5, 10)
+- **security-issue-invalidate** — 24 cases across 9 steps (2, 3, 4, 5a, 5b, 
5d, 5e, 5f, 7)
+- **security-issue-import-from-md** — 11 cases across 4 steps (1, 2, 4, 6)
+- **security-issue-import-from-pr** — 13 cases across 4 steps (2, 3, 6, 8)
+
+## Run
+
+```bash
+# All cases for a skill
+uv run --project tools/skill-evals skill-eval \

Review Comment:
   This invocation needs a `pyproject.toml` under `tools/skill-evals/` 
declaring the src-layout `skill_evals` package and the `skill-eval` console 
script. None is shipped in this PR — `uv run --project tools/skill-evals 
skill-eval ...` will fail with "package not found" today. Same applies to the 
`python3 -m skill_evals.eval_runner ...` invocation in `eval_runner.py`'s 
docstring.



##########
tools/skill-evals/src/skill_evals/runner.py:
##########
@@ -0,0 +1,251 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Skill eval runner.
+
+Loads fixture data from an eval case directory and prints the system prompt
+and user prompt for each case so you can paste them into any model.
+Compare the model's response against expected.json to verify correctness.
+
+Usage:
+    # Print prompts for all cases under a fixtures directory
+    uv run --project tools/skill-evals skill-eval \\
+        evals/security-issue-import/step-2a-semantic-sweep/fixtures/
+
+    # Print prompt for a single case
+    uv run --project tools/skill-evals skill-eval \\
+        
evals/security-issue-import/step-2a-semantic-sweep/fixtures/case-1-clear-duplicate
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+
+# ---------------------------------------------------------------------------
+# Prompt construction
+# ---------------------------------------------------------------------------
+
+USER_PROMPT_TEMPLATE = """\
+## Existing open trackers (corpus)
+
+{corpus}
+
+## Reporter roster (existing trackers mapped to reporter email)
+
+{roster}
+
+## Incoming report
+
+{report}
+
+Apply the semantic sweep and reporter-identity check. Return JSON only.
+"""
+
+
+def build_corpus_text(corpus: list[dict]) -> str:
+    lines = []
+    for item in corpus:
+        lines.append(f"#{item['number']} | {item['title']!r}")
+        lines.append(f"Body (first 300 chars): {item['body']}")
+        lines.append("")
+    return "\n".join(lines)
+
+
+def build_roster_text(roster: dict[str, str]) -> str:
+    if not roster:
+        return "(none)"
+    return "\n".join(f"#{num}: {email}" for num, email in roster.items())
+
+
+def find_repo_root(start: Path) -> Path:
+    """Walk up the directory tree until a .git directory is found."""
+    p = start.resolve()
+    while p != p.parent:
+        if (p / ".git").exists():
+            return p
+        p = p.parent
+    raise RuntimeError(f"Could not find repo root (.git) from {start}")
+
+
+def extract_skill_section(skill_md_path: Path, heading: str) -> str:
+    """Return the section of a SKILL.md that begins with *heading*.
+
+    Extraction ends at the next heading of the same or higher level, or at
+    the end of the file.  Raises ValueError if the heading is not found.
+    """
+    text = skill_md_path.read_text()
+    lines = text.split("\n")
+    heading_stripped = heading.rstrip()
+    heading_level = len(heading) - len(heading.lstrip("#"))
+
+    start = next(
+        (i for i, line in enumerate(lines) if line.rstrip() == 
heading_stripped),
+        None,
+    )
+    if start is None:
+        raise ValueError(f"Heading {heading!r} not found in {skill_md_path}")
+
+    end = len(lines)
+    for i in range(start + 1, len(lines)):
+        depth = len(lines[i]) - len(lines[i].lstrip("#"))
+        if depth > 0 and depth <= heading_level and 
lines[i].lstrip("#").startswith(" "):
+            end = i
+            break
+
+    return "\n".join(lines[start:end]).rstrip()
+
+
+def load_step_config(fixtures_dir: Path) -> tuple[str, str]:
+    """Return (system_prompt, user_prompt_template) for the given fixtures dir.
+
+    Resolution order:
+    1. ``step-config.json`` — extracts the step section live from the skill's
+       SKILL.md, then appends ``output-spec.md`` if present.  This is the
+       preferred path: tests automatically exercise the current skill text.
+    2. ``system-prompt.md`` — a manually maintained prompt used by triage 
steps.
+
+    Raises FileNotFoundError if neither file is present.
+    """
+    user_tmpl_path = fixtures_dir / "user-prompt-template.md"
+    user_prompt_template = user_tmpl_path.read_text() if 
user_tmpl_path.exists() else USER_PROMPT_TEMPLATE
+
+    # 1. step-config.json → live extraction from SKILL.md
+    config_path = fixtures_dir / "step-config.json"
+    if config_path.exists():
+        config = json.loads(config_path.read_text())
+        repo_root = find_repo_root(fixtures_dir)
+        skill_md_path = repo_root / config["skill_md"]
+        section = extract_skill_section(skill_md_path, config["step_heading"])
+        output_spec_path = fixtures_dir / "output-spec.md"
+        if output_spec_path.exists():
+            section += "\n\n" + output_spec_path.read_text()
+        return section, user_prompt_template
+
+    # 2. system-prompt.md → manually maintained (triage steps)
+    sys_prompt_path = fixtures_dir / "system-prompt.md"
+    if sys_prompt_path.exists():
+        return sys_prompt_path.read_text(), user_prompt_template
+
+    raise FileNotFoundError(
+        f"{fixtures_dir} has neither step-config.json nor system-prompt.md. "
+        "Add a step-config.json pointing at the relevant SKILL.md section."
+    )
+
+
+# ---------------------------------------------------------------------------
+# Case loading
+# ---------------------------------------------------------------------------
+
+
+def load_case(case_dir: Path) -> tuple[list[dict], dict, str, dict]:
+    """Return (corpus, roster, report_text, expected).
+
+    ``corpus.json`` is optional — steps that do not need a tracker corpus
+    (e.g. Step 3 classification) simply omit it and get an empty list.
+    """
+    fixtures_dir = case_dir.parent
+    corpus_path = fixtures_dir / "corpus.json"
+    roster_path = fixtures_dir / "reporter-roster.json"
+
+    corpus = json.loads(corpus_path.read_text()) if corpus_path.exists() else 
[]
+    roster = json.loads(roster_path.read_text()) if roster_path.exists() else 
{}
+    report = (case_dir / "report.md").read_text()
+    expected = json.loads((case_dir / "expected.json").read_text())
+    return corpus, roster, report, expected
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def find_cases(path: Path) -> list[tuple[Path, Path]]:
+    """Return (case_dir, fixtures_dir) pairs under path.
+
+    Handles three levels of granularity:
+      - single case dir     (contains report.md)
+      - fixtures dir        (contains case-* subdirs)
+      - skill/step dir      (contains fixtures/ subdirs recursively)
+    """
+    if (path / "report.md").exists():
+        return [(path, path.parent)]
+    # Direct fixtures dir — all cases share the same fixtures dir.
+    direct = sorted(p for p in path.iterdir() if p.is_dir() and (p / 
"report.md").exists())
+    if direct:
+        return [(p, path) for p in direct]
+    # Recursive search — e.g. skill dir spanning multiple steps.
+    results = []
+    for fixtures_dir in sorted(path.rglob("fixtures")):
+        if not fixtures_dir.is_dir():
+            continue
+        for case_dir in sorted(fixtures_dir.iterdir()):
+            if case_dir.is_dir() and (case_dir / "report.md").exists():
+                results.append((case_dir, fixtures_dir))
+    return results
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Print eval prompts for skill cases. Paste into any model 
and compare against expected.json."
+    )
+    parser.add_argument(
+        "path",
+        type=Path,
+        help="Path to a single case directory or a fixtures directory 
containing multiple cases.",
+    )
+    args = parser.parse_args()
+
+    cases = find_cases(args.path)
+    if not cases:
+        print(f"No eval cases found under {args.path}", file=sys.stderr)
+        sys.exit(1)
+
+    # Cache loaded step configs so we don't re-read prompts for every case in
+    # the same fixtures dir (common when running a whole skill at once).
+    _step_config_cache: dict[Path, tuple[str, str]] = {}
+
+    for case_dir, fixtures_dir in cases:
+        if fixtures_dir not in _step_config_cache:
+            _step_config_cache[fixtures_dir] = load_step_config(fixtures_dir)
+        system_prompt, user_prompt_template = _step_config_cache[fixtures_dir]
+
+        corpus, roster, report, expected = load_case(case_dir)
+        user_prompt = user_prompt_template.format(

Review Comment:
   `.format()` on a user-supplied template is fragile — if anyone ever puts a 
literal `{` in a future `user-prompt-template.md` (e.g. a JSON schema 
fragment), this raises `KeyError`. Current templates are clean, but worth 
either documenting the contract in `user-prompt-template.md` files or switching 
to `string.Template` which uses `$name` and ignores stray braces.



##########
tools/skill-evals/src/skill_evals/eval_runner.py:
##########
@@ -0,0 +1,195 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Automated eval runner — calls Claude for each case and diffs against 
expected.json.
+
+Usage:
+    export ANTHROPIC_API_KEY=sk-ant-...
+    python3 -m skill_evals.eval_runner 
tools/skill-evals/evals/security-issue-sync/step-2c-next-step/fixtures/
+    python3 -m skill_evals.eval_runner tools/skill-evals/evals/          # all 
suites
+
+Options:
+    --model MODEL   Model to use (default: claude-haiku-4-5-20251001)
+    --filter GLOB   Only run cases whose path matches GLOB
+    --fail-fast     Stop on first failure
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import time
+from pathlib import Path
+
+import anthropic
+
+from skill_evals.runner import find_cases, load_case, load_step_config, 
build_corpus_text, build_roster_text
+
+# ---------------------------------------------------------------------------
+# JSON extraction
+# ---------------------------------------------------------------------------
+
+def extract_json(text: str) -> dict:
+    """Extract the first JSON object from a model response."""
+    # Strip markdown code fences
+    text = re.sub(r"^```(?:json)?\s*", "", text.strip(), flags=re.MULTILINE)

Review Comment:
   `re.MULTILINE` makes `^` and `$` match every line — every ``` triplet 
anywhere in the response gets stripped, not just the outer fence pair. The 
fallback `{...}` brace-counter then runs against a mangled string. Drop 
`MULTILINE` (strip only at string ends), or extract the first balanced 
```...``` block deliberately.



##########
tools/skill-evals/src/skill_evals/eval_runner.py:
##########
@@ -0,0 +1,195 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Automated eval runner — calls Claude for each case and diffs against 
expected.json.
+
+Usage:
+    export ANTHROPIC_API_KEY=sk-ant-...
+    python3 -m skill_evals.eval_runner 
tools/skill-evals/evals/security-issue-sync/step-2c-next-step/fixtures/
+    python3 -m skill_evals.eval_runner tools/skill-evals/evals/          # all 
suites
+
+Options:
+    --model MODEL   Model to use (default: claude-haiku-4-5-20251001)
+    --filter GLOB   Only run cases whose path matches GLOB
+    --fail-fast     Stop on first failure
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import re
+import sys
+import time
+from pathlib import Path
+
+import anthropic
+
+from skill_evals.runner import find_cases, load_case, load_step_config, 
build_corpus_text, build_roster_text
+
+# ---------------------------------------------------------------------------
+# JSON extraction
+# ---------------------------------------------------------------------------
+
+def extract_json(text: str) -> dict:
+    """Extract the first JSON object from a model response."""
+    # Strip markdown code fences
+    text = re.sub(r"^```(?:json)?\s*", "", text.strip(), flags=re.MULTILINE)
+    text = re.sub(r"```\s*$", "", text.strip(), flags=re.MULTILINE)
+    text = text.strip()
+
+    # Try parsing the whole thing first
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+
+    # Find first {...} block
+    start = text.find("{")
+    if start == -1:
+        raise ValueError(f"No JSON object found in response:\n{text[:300]}")
+    depth = 0
+    for i, ch in enumerate(text[start:], start):
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start : i + 1])
+    raise ValueError(f"Unclosed JSON object in response:\n{text[:300]}")
+
+
+# ---------------------------------------------------------------------------
+# Comparison
+# ---------------------------------------------------------------------------
+
+def compare(expected: dict, actual: dict) -> tuple[bool, list[str]]:
+    """
+    Compare expected vs actual JSON.
+
+    For boolean fields: assert equal.
+    For integer/string fields: assert equal.
+    For list fields: assert equal (exact match).
+    Returns (passed, list_of_failure_messages).
+    """
+    failures = []
+    for key, exp_val in expected.items():
+        if key not in actual:
+            failures.append(f"  missing key '{key}' (expected {exp_val!r})")
+            continue
+        act_val = actual[key]
+        if exp_val != act_val:
+            failures.append(f"  '{key}': expected {exp_val!r}, got 
{act_val!r}")
+    # Flag unexpected keys that are booleans and wrong (lenient on extras)

Review Comment:
   The README explains that composition steps use structural flags 
(`has_security_model_quote`, `has_bare_issue_numbers`, `mention_handles`) to 
avoid brittle prose match. But `compare` just does `exp_val != act_val` per key 
— meaning the *model* has to compute and emit those derived booleans about its 
own prose, not the runner. That's (a) not what the README implies and (b) a 
weaker test than the runner deriving the booleans from the response. Either 
clarify the README or move the derivation into the comparator.
   
   Also: the dead comment on this line (`Flag unexpected keys that are booleans 
and wrong (lenient on extras)`) describes behavior that isn't implemented — 
remove or implement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

Reply via email to