Re: [PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

via GitHub Fri, 15 May 2026 20:14:19 -0700


justinmclean commented on PR #158:
URL: https://github.com/apache/airflow-steward/pull/158#issuecomment-4465343784


   I updated and expanded on this for the cover two skills.
   
   runner.py now assembles the system prompt by extracting the relevant section 
directly from SKILL.md at run time (via a step-config.json pointer in each 
fixtures directory), rather than relying on a static copy. A change to the 
skill rules is immediately visible in the prompt — if it would cause the model 
to produce different output, the test fails.
   
   External tool calls (GitHub CLI, Gmail MCP, canned-response scan) are never 
executed during evals. Their outputs are pre-rendered as structured text in 
each case's report.md and injected into the user turn as mock data. This keeps 
inputs fully deterministic and requires no network access or API credentials to 
run.
   
   Removed the hardcoded SYSTEM_PROMPT fallback in runner.py. 
   
   65/65 cases pass (32 import, 33 triage).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] An idea for testing fixture-based eval harness for skill steps [DRAFT] [airflow-steward]

Reply via email to