Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Alex Mon, 23 Feb 2026 17:28:27 -0800

Thanks Dev-iL, Ephraim, and Jarek for the thoughtful questions. A few
clarifications.

To start with the practical concerns around CI and cost: this proposal does
not require running paid LLM calls in Airflow core CI. Exams can run
locally with open models (the examples I shared used llama.cpp-like servers
and qwen-coder:8b, gpt-oss:20b and deepseek-r1:14b), manually when someone
is iterating on a feature, or in a provider's own infrastructure. The AIP
defines a reproducible format. Where and how people choose to run it is up
to them.

On the value of results over time and across models: a cert is not meant to
be a permanent stamp of approval. It records that a specific setup passed
specific user stories at a specific point in time. If the model underneath
changes and something breaks, that's exactly what you'd want to catch. The
goal is to make that kind of change visible rather than something you
discover later. If we ever ship specific tools ourselves, we should hold
them to the same standard in system tests. To be clear, if we were at the
stage where we shipped an embedded model (not saying we are), then the
expectation should absolutely be stability. How to get there involves real
design decisions: what's the blast radius if the agent gets it wrong, what
does the exam cover, what are the internal retry and checking mechanisms,
and how much variance testing (multiple runs, monte carlo on outputs) is
appropriate for that specific case. Those conversations are probably easier
to have once there is an exam and there's reference "agents" that people
want to try them with.

We can look at the concrete example of the agent skill that translates
Airflow UI strings to Spanish. Contributor A wants to improve the skill
definition to be more token efficient. Contributor B sees there's a 20% gap
of untranslated strings and decides to run it with --add-missing but
doesn't have Claude like other contributors. They only have local models.
Are they good enough? Do they really need to pay for Claude or Codex, or
can they just use their agent which uses a local 27B model as one of the
underlying models? Without a shared set of expectations, they have to test
it manually and might miss something. But if there's already a small exam
that captures the key behaviors (does it preserve template variables, does
it handle plurals correctly, does it get the tone right), they can run it
themselves and know where they stand. It's not about ranking models or
proving anything to anyone else but giving yourself and your collaborators
a quick way to verify that things still work after a change.

If the agent skill example seems too simple, the same pattern applies to
code generation, API calls from natural language, or migrations. The
hardest part in all of these is describing the expected behavior in a way
that's testable without the tests becoming a maintenance burden themselves.

I think the part that matters most here isn't really about agents or LLMs
specifically. It's about codifying user stories in a testable form. As more
interactions with Airflow happen through agents and AI-powered tools,
having those expectations written down and runnable becomes a way to test
changes to the platform itself. If an API changes, does the experience
still hold? If a prompt strategy changes, does it reduce cost without
breaking behavior? Without testable user stories, those questions stay
anecdotal. With them, you can modify the right parts of the system and
measure the impact, regardless of what agent or model someone happens to be
using.

On gaming: I don't think it's a real risk here, because the benchmark is
for the person running it, not for someone else to trust blindly. You're
testing your own setup. If someone designs an exam poorly, the answer is to
improve the exam, the same way you'd improve a test suite that doesn't
catch real bugs. The accountability is always on the user unless we're
actually shipping an agent ourselves.

Jarek, the ecosystem path makes a lot of sense and I'm happy to keep
building and exploring there. I do think this discussion is really valuable
regardless of where the work lives, because the questions you're all
raising (cost, stability, the meaning of words like "conformance" in this
space) are exactly the ones the community needs to work through as AI
capabilities land in Airflow. I'd rather we have this conversation early in
case it helps align different AI AIP builders and ecosystem tools around
shared intent and reproducibility.

P.S. Dev-iL, thanks for flagging the spec link. I had botched the link in
the message, here it is: https://ai-evals.io/spec/cert/v0.1.0/schema.json
(I'll iterate on it, I've been looking at comparable specs/standards).

Alex Guglielmone Nemi

On Mon, Feb 23, 2026 at 8:32 AM Jarek Potiuk <[email protected]> wrote:

> While I think it's a good idea, I also think it would be great to have
> a third-party run such evaluations on your site. You could ask people
> to contribute there - rather than having an official AIP and something
> "in-airflow" - simply because we do not know what value it can bring
> to the community, what kind of burden it will introduce - and the
> questions of Dev-IL regarding how to run it, how to block CI and the
> value and "official status" of having "certificates" officially by the
> Airflow community are probably the most important questions that need
> answering.
>
> I think - since this is pretty orthogonal to Airflow's regular work
> and features of Airflow, I would rather (in your case) focus on
> running such evaluations and demonstrating their value outside of the
> core Airflow efforts. For years, we have been preaching "what we can
> remove from Airflow run by the community here rather than what we can
> add." This immediately looks like you can run outside for a while -
> and demonstrate how it adds value - both to Airflow and your
> `ai-evals.io` site and schema.  There is no clear "standard" in
> evaluation frameworks. Many exist, and perhaps more established
> standards will arise eventually. That would be a great moment for us
> to adopt one, but until then, I think it's more of a "let's wait and
> see" from the community standpoint, and you could start working on
> getting the standards - and demonstrate how it might be good for
> Airflow, then your solution might be a good candidate to build the
> standard around.
>
> I think that can easily be part of our ecosystem
> https://airflow.apache.org/ecosystem/  and you are absolutely welcome
> to share any progress here and ask people to participate, but I think
> it's quite a bit too early to "adopt" it. by the community.
>
> J.
>
> On Mon, Feb 23, 2026 at 9:09 AM Ephraim Anierobi
> <[email protected]> wrote:
> >
> > Hi Alex,
> >
> > I share the same concerns as Dev-iL regarding running it in the CI(which
> is
> > the best place to run this kind of test/exam) and the cost.
> >
> > For the AIP, the two other big risks I see:
> >
> > 1. You mentioned that pass/fail won’t be stable as the downsides to the
> AIP
> > and that's true because LLM answers can change from run to run, and also
> > change when the model or version changes. So a “cert/conformance” label
> > could be misleading and likely cause arguments in AIPs. Debates could
> shift
> > from implementation to exam semantics.
> >
> > 2. It seems like the system can be gamed. Scores can shift because the
> > grader changes, and people may end up optimizing for the grader instead
> of
> > quality. What do you think?
> >
> > Best,
> > Ephraim
> >
> >
> >
> > On Mon, 23 Feb 2026 at 05:47, Dev-iL <[email protected]> wrote:
> >
> > > Hi Alex,
> > > Thank you for the interesting suggestion!
> > >
> > > I have several questions about practical aspects of these evaluation:
> > > 1. Who is supposed to run them and at what stage? It sounds to me that
> for
> > > maximum benefit this should run as part of CI, at least when LLM-facing
> > > features are modified. If so, where are we going to get API credits to
> run
> > > this?
> > > 2. As you mentioned, this is a rapidly developing space with new models
> > > popping up on a regular basis. What is the benefit of knowing that
> evals
> > > passed at a given point in time with a given model? Not all users have
> > > access to all LLMs, and results obtained on one model don't tell us
> another
> > > model would behave. Say an AIP was drafted and evaluated on ChatGPT
> 5.3,
> > > and by the time it reaches the user, the latest version might be 6.1.
> Do we
> > > expect users to use a older model just for interacting with Airflow?
> Do we
> > > expect users to submit certs if they tried the code on new models?
> > >
> > > Would appreciate your clarifications!
> > > Dev-iL
> > >
> > > P.S.
> > > The spec link is broken.
> > >
> > > On Mon, 23 Feb 2026, 4:47 Alex, <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to propose an AIP [1] to establish a shared benchmark and
> > > > conformance standard for AI capabilities in Airflow. Sharing this to
> > > gather
> > > > feedback and rough consensus.
> > > >
> > > > The core idea is to give AI capabilities an exam. Define what
> "qualified"
> > > > looks like for a given role - DAG Operator, DAG Code Generator, DAG
> > > Fixer,
> > > > Migration Agent - and let anyone reproduce that result. Once the exam
> > > > exists, the conversation about whether a feature is ready becomes
> > > concrete.
> > > > Each AI-related AIP can declare which roles it targets and ship a
> > > > corresponding exam suite, without depending on another AIP's roadmap.
> > > The
> > > > same goes for providers or anyone experimenting in this fast moving
> > > space.
> > > > It provides a common language for AI evaluation (often referred to
> as AI
> > > > Evals).
> > > >
> > > > *A useful example is already running against a real Airflow
> localization
> > > > skill,* with a viewable cert here [2]. A simpler non-Airflow example
> is
> > > > also available [3]. The exam showcases the pattern that allows us to
> > > > produce machine readable and human comparable outputs for easy
> > > > collaboration, regardless of the aspects of the black box (Agent,
> Model,
> > > > Skill, MCP).
> > > >
> > > > I gave a lightning talk on this topic at Airflow Summit 2025 [4] and
> have
> > > > been building toward it since: evals-playground [5] holds working
> > > examples
> > > > against Airflow AI capabilities, ai-evals.io [6] explains the
> pattern to
> > > > people from different backgrounds, eval-ception [7] demonstrates it
> > > > hands-on, and the exam spec [8] is taking shape at ai-evals.io/spec.
> > > >
> > > > Let me know if you have any questions.
> > > >
> > > > - [1] Draft proposal -
> > > >
> https://docs.google.com/document/d/1KvEX9zdq9-NMfSnl_qvET5SgSeujF-Zz/
> > > > - [2] Cert viewer - Airflow localizer exam:
> > > >
> > > >
> > >
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/airflow-localizer-es/airflow-es-localizer-exam-pydantic-agent.cert.json
> > > > - [3]
> > > >
> > > >
> > >
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/simple-exam/simple-exam.ollama-agent.cert.json
> > > > - [4] Toward a Shared Vision for LLM Evaluation in the Airflow
> Ecosystem
> > > -
> > > > Airflow Summit 2025 - Lightning Talk (5 min) -
> > > >
> > > >
> > >
> https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
> > > > - [5] https://github.com/Alexhans/evals-playground
> > > > - [6] https://ai-evals.io/
> > > > - [7] https://github.com/Alexhans/eval-ception
> > > > - [8] https://ai-evals.io/spec/cert/v0.1.0/schema.json
> > > >
> > > > Alex Guglielmone Nemi
> > > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Reply via email to