Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Dev-iL Sun, 22 Feb 2026 20:45:38 -0800

Hi Alex,
Thank you for the interesting suggestion!

I have several questions about practical aspects of these evaluation:
1. Who is supposed to run them and at what stage? It sounds to me that for
maximum benefit this should run as part of CI, at least when LLM-facing
features are modified. If so, where are we going to get API credits to run
this?
2. As you mentioned, this is a rapidly developing space with new models
popping up on a regular basis. What is the benefit of knowing that evals
passed at a given point in time with a given model? Not all users have
access to all LLMs, and results obtained on one model don't tell us another
model would behave. Say an AIP was drafted and evaluated on ChatGPT 5.3,
and by the time it reaches the user, the latest version might be 6.1. Do we
expect users to use a older model just for interacting with Airflow? Do we
expect users to submit certs if they tried the code on new models?


Would appreciate your clarifications!
Dev-iL

P.S.
The spec link is broken.

On Mon, 23 Feb 2026, 4:47 Alex, <[email protected]> wrote:

> Hi all,
>
> I'd like to propose an AIP [1] to establish a shared benchmark and
> conformance standard for AI capabilities in Airflow. Sharing this to gather
> feedback and rough consensus.
>
> The core idea is to give AI capabilities an exam. Define what "qualified"
> looks like for a given role - DAG Operator, DAG Code Generator, DAG Fixer,
> Migration Agent - and let anyone reproduce that result. Once the exam
> exists, the conversation about whether a feature is ready becomes concrete.
> Each AI-related AIP can declare which roles it targets and ship a
> corresponding exam suite, without depending on another AIP's roadmap.  The
> same goes for providers or anyone experimenting in this fast moving space.
> It provides a common language for AI evaluation (often referred to as AI
> Evals).
>
> *A useful example is already running against a real Airflow localization
> skill,* with a viewable cert here [2]. A simpler non-Airflow example is
> also available [3]. The exam showcases the pattern that allows us to
> produce machine readable and human comparable outputs for easy
> collaboration, regardless of the aspects of the black box (Agent, Model,
> Skill, MCP).
>
> I gave a lightning talk on this topic at Airflow Summit 2025 [4] and have
> been building toward it since: evals-playground [5] holds working examples
> against Airflow AI capabilities, ai-evals.io [6] explains the pattern to
> people from different backgrounds, eval-ception [7] demonstrates it
> hands-on, and the exam spec [8] is taking shape at ai-evals.io/spec.
>
> Let me know if you have any questions.
>
> - [1] Draft proposal -
> https://docs.google.com/document/d/1KvEX9zdq9-NMfSnl_qvET5SgSeujF-Zz/
> - [2] Cert viewer - Airflow localizer exam:
>
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/airflow-localizer-es/airflow-es-localizer-exam-pydantic-agent.cert.json
> - [3]
>
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/simple-exam/simple-exam.ollama-agent.cert.json
> - [4] Toward a Shared Vision for LLM Evaluation in the Airflow Ecosystem -
> Airflow Summit 2025 - Lightning Talk (5 min) -
>
> https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
> - [5] https://github.com/Alexhans/evals-playground
> - [6] https://ai-evals.io/
> - [7] https://github.com/Alexhans/eval-ception
> - [8] https://ai-evals.io/spec/cert/v0.1.0/schema.json
>
> Alex Guglielmone Nemi
>

Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Reply via email to