Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Zhe-You(Jason) Liu Wed, 25 Feb 2026 00:55:28 -0800

Hi Alex,

Thank you for raising the new AIP on evaluating AI capabilities in Airflow.


I saw your reply in the GSoC thread, and I would like to continue the
discussion here so we don’t mix it with the GSoC proposal thread.

> Have you already defined pass/fail criteria for those scenarios you
mention, or is it more of a manual “looks right” check today?

I think the criterias from the GSoC thread “can the agent correctly
distinguish host vs container context, run the right
commands in the right environment, and verify outcomes?” would be
sufficient, and it is clearer than the translation skill scenario. The
criteria for distinguishing whether the setup passes or fails are also
clearer, since “being able to distinguish the current environment and run
different sets of commands” is factual, whereas translation can sometimes
be a matter of opinion (for both humans and different AI models).

Additionally, comparing the agent skills we plan to contribute with the
recently refactored AGENTS.md [1] would be a good candidate scenario for
evaluation -- for example, to see whether the refactored AGENTS.md is
already accurate enough to describe the expected behavior above, or whether
we still require dedicated agent skills to achieve it.

Along with all the discussion above, I’m also a bit concerned about the
long-term maintenance effort if we really want to integrate the evaluation
system into the official Airflow repo. Alternatively, if we do decide to go
in the official Airflow repo direction, maybe a small PoC with only one
scenario could help demonstrate what the maintenance scope would look like.

Thanks!

[1] https://lists.apache.org/thread/szygtjs2t7mb6c2kk4zjygslxf0d4od8

Best,
Jason

On Tue, Feb 24, 2026 at 9:28 AM Alex <[email protected]> wrote:

> Thanks Dev-iL, Ephraim, and Jarek for the thoughtful questions. A few
> clarifications.
>
> To start with the practical concerns around CI and cost: this proposal does
> not require running paid LLM calls in Airflow core CI. Exams can run
> locally with open models (the examples I shared used llama.cpp-like servers
> and qwen-coder:8b, gpt-oss:20b and deepseek-r1:14b), manually when someone
> is iterating on a feature, or in a provider's own infrastructure. The AIP
> defines a reproducible format. Where and how people choose to run it is up
> to them.
>
> On the value of results over time and across models: a cert is not meant to
> be a permanent stamp of approval. It records that a specific setup passed
> specific user stories at a specific point in time. If the model underneath
> changes and something breaks, that's exactly what you'd want to catch. The
> goal is to make that kind of change visible rather than something you
> discover later. If we ever ship specific tools ourselves, we should hold
> them to the same standard in system tests. To be clear, if we were at the
> stage where we shipped an embedded model (not saying we are), then the
> expectation should absolutely be stability. How to get there involves real
> design decisions: what's the blast radius if the agent gets it wrong, what
> does the exam cover, what are the internal retry and checking mechanisms,
> and how much variance testing (multiple runs, monte carlo on outputs) is
> appropriate for that specific case. Those conversations are probably easier
> to have once there is an exam and there's reference "agents" that people
> want to try them with.
>
> We can look at the concrete example of the agent skill that translates
> Airflow UI strings to Spanish. Contributor A wants to improve the skill
> definition to be more token efficient. Contributor B sees there's a 20% gap
> of untranslated strings and decides to run it with --add-missing but
> doesn't have Claude like other contributors. They only have local models.
> Are they good enough? Do they really need to pay for Claude or Codex, or
> can they just use their agent which uses a local 27B model as one of the
> underlying models? Without a shared set of expectations, they have to test
> it manually and might miss something. But if there's already a small exam
> that captures the key behaviors (does it preserve template variables, does
> it handle plurals correctly, does it get the tone right), they can run it
> themselves and know where they stand. It's not about ranking models or
> proving anything to anyone else but giving yourself and your collaborators
> a quick way to verify that things still work after a change.
>
> If the agent skill example seems too simple, the same pattern applies to
> code generation, API calls from natural language, or migrations. The
> hardest part in all of these is describing the expected behavior in a way
> that's testable without the tests becoming a maintenance burden themselves.
>
> I think the part that matters most here isn't really about agents or LLMs
> specifically. It's about codifying user stories in a testable form. As more
> interactions with Airflow happen through agents and AI-powered tools,
> having those expectations written down and runnable becomes a way to test
> changes to the platform itself. If an API changes, does the experience
> still hold? If a prompt strategy changes, does it reduce cost without
> breaking behavior? Without testable user stories, those questions stay
> anecdotal. With them, you can modify the right parts of the system and
> measure the impact, regardless of what agent or model someone happens to be
> using.
>
> On gaming: I don't think it's a real risk here, because the benchmark is
> for the person running it, not for someone else to trust blindly. You're
> testing your own setup. If someone designs an exam poorly, the answer is to
> improve the exam, the same way you'd improve a test suite that doesn't
> catch real bugs. The accountability is always on the user unless we're
> actually shipping an agent ourselves.
>
> Jarek, the ecosystem path makes a lot of sense and I'm happy to keep
> building and exploring there. I do think this discussion is really valuable
> regardless of where the work lives, because the questions you're all
> raising (cost, stability, the meaning of words like "conformance" in this
> space) are exactly the ones the community needs to work through as AI
> capabilities land in Airflow. I'd rather we have this conversation early in
> case it helps align different AI AIP builders and ecosystem tools around
> shared intent and reproducibility.
>
> P.S. Dev-iL, thanks for flagging the spec link. I had botched the link in
> the message, here it is: https://ai-evals.io/spec/cert/v0.1.0/schema.json
> (I'll iterate on it, I've been looking at comparable specs/standards).
>
> Alex Guglielmone Nemi
>
> On Mon, Feb 23, 2026 at 8:32 AM Jarek Potiuk <[email protected]> wrote:
>
> > While I think it's a good idea, I also think it would be great to have
> > a third-party run such evaluations on your site. You could ask people
> > to contribute there - rather than having an official AIP and something
> > "in-airflow" - simply because we do not know what value it can bring
> > to the community, what kind of burden it will introduce - and the
> > questions of Dev-IL regarding how to run it, how to block CI and the
> > value and "official status" of having "certificates" officially by the
> > Airflow community are probably the most important questions that need
> > answering.
> >
> > I think - since this is pretty orthogonal to Airflow's regular work
> > and features of Airflow, I would rather (in your case) focus on
> > running such evaluations and demonstrating their value outside of the
> > core Airflow efforts. For years, we have been preaching "what we can
> > remove from Airflow run by the community here rather than what we can
> > add." This immediately looks like you can run outside for a while -
> > and demonstrate how it adds value - both to Airflow and your
> > `ai-evals.io` site and schema.  There is no clear "standard" in
> > evaluation frameworks. Many exist, and perhaps more established
> > standards will arise eventually. That would be a great moment for us
> > to adopt one, but until then, I think it's more of a "let's wait and
> > see" from the community standpoint, and you could start working on
> > getting the standards - and demonstrate how it might be good for
> > Airflow, then your solution might be a good candidate to build the
> > standard around.
> >
> > I think that can easily be part of our ecosystem
> > https://airflow.apache.org/ecosystem/  and you are absolutely welcome
> > to share any progress here and ask people to participate, but I think
> > it's quite a bit too early to "adopt" it. by the community.
> >
> > J.
> >
> > On Mon, Feb 23, 2026 at 9:09 AM Ephraim Anierobi
> > <[email protected]> wrote:
> > >
> > > Hi Alex,
> > >
> > > I share the same concerns as Dev-iL regarding running it in the
> CI(which
> > is
> > > the best place to run this kind of test/exam) and the cost.
> > >
> > > For the AIP, the two other big risks I see:
> > >
> > > 1. You mentioned that pass/fail won’t be stable as the downsides to the
> > AIP
> > > and that's true because LLM answers can change from run to run, and
> also
> > > change when the model or version changes. So a “cert/conformance” label
> > > could be misleading and likely cause arguments in AIPs. Debates could
> > shift
> > > from implementation to exam semantics.
> > >
> > > 2. It seems like the system can be gamed. Scores can shift because the
> > > grader changes, and people may end up optimizing for the grader instead
> > of
> > > quality. What do you think?
> > >
> > > Best,
> > > Ephraim
> > >
> > >
> > >
> > > On Mon, 23 Feb 2026 at 05:47, Dev-iL <[email protected]> wrote:
> > >
> > > > Hi Alex,
> > > > Thank you for the interesting suggestion!
> > > >
> > > > I have several questions about practical aspects of these evaluation:
> > > > 1. Who is supposed to run them and at what stage? It sounds to me
> that
> > for
> > > > maximum benefit this should run as part of CI, at least when
> LLM-facing
> > > > features are modified. If so, where are we going to get API credits
> to
> > run
> > > > this?
> > > > 2. As you mentioned, this is a rapidly developing space with new
> models
> > > > popping up on a regular basis. What is the benefit of knowing that
> > evals
> > > > passed at a given point in time with a given model? Not all users
> have
> > > > access to all LLMs, and results obtained on one model don't tell us
> > another
> > > > model would behave. Say an AIP was drafted and evaluated on ChatGPT
> > 5.3,
> > > > and by the time it reaches the user, the latest version might be 6.1.
> > Do we
> > > > expect users to use a older model just for interacting with Airflow?
> > Do we
> > > > expect users to submit certs if they tried the code on new models?
> > > >
> > > > Would appreciate your clarifications!
> > > > Dev-iL
> > > >
> > > > P.S.
> > > > The spec link is broken.
> > > >
> > > > On Mon, 23 Feb 2026, 4:47 Alex, <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to propose an AIP [1] to establish a shared benchmark and
> > > > > conformance standard for AI capabilities in Airflow. Sharing this
> to
> > > > gather
> > > > > feedback and rough consensus.
> > > > >
> > > > > The core idea is to give AI capabilities an exam. Define what
> > "qualified"
> > > > > looks like for a given role - DAG Operator, DAG Code Generator, DAG
> > > > Fixer,
> > > > > Migration Agent - and let anyone reproduce that result. Once the
> exam
> > > > > exists, the conversation about whether a feature is ready becomes
> > > > concrete.
> > > > > Each AI-related AIP can declare which roles it targets and ship a
> > > > > corresponding exam suite, without depending on another AIP's
> roadmap.
> > > > The
> > > > > same goes for providers or anyone experimenting in this fast moving
> > > > space.
> > > > > It provides a common language for AI evaluation (often referred to
> > as AI
> > > > > Evals).
> > > > >
> > > > > *A useful example is already running against a real Airflow
> > localization
> > > > > skill,* with a viewable cert here [2]. A simpler non-Airflow
> example
> > is
> > > > > also available [3]. The exam showcases the pattern that allows us
> to
> > > > > produce machine readable and human comparable outputs for easy
> > > > > collaboration, regardless of the aspects of the black box (Agent,
> > Model,
> > > > > Skill, MCP).
> > > > >
> > > > > I gave a lightning talk on this topic at Airflow Summit 2025 [4]
> and
> > have
> > > > > been building toward it since: evals-playground [5] holds working
> > > > examples
> > > > > against Airflow AI capabilities, ai-evals.io [6] explains the
> > pattern to
> > > > > people from different backgrounds, eval-ception [7] demonstrates it
> > > > > hands-on, and the exam spec [8] is taking shape at
> ai-evals.io/spec.
> > > > >
> > > > > Let me know if you have any questions.
> > > > >
> > > > > - [1] Draft proposal -
> > > > >
> > https://docs.google.com/document/d/1KvEX9zdq9-NMfSnl_qvET5SgSeujF-Zz/
> > > > > - [2] Cert viewer - Airflow localizer exam:
> > > > >
> > > > >
> > > >
> >
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/airflow-localizer-es/airflow-es-localizer-exam-pydantic-agent.cert.json
> > > > > - [3]
> > > > >
> > > > >
> > > >
> >
> https://ai-evals.io/certify-your-agent/viewer.html?cert=https://raw.githubusercontent.com/Alexhans/eval-ception/main/certs/simple-exam/simple-exam.ollama-agent.cert.json
> > > > > - [4] Toward a Shared Vision for LLM Evaluation in the Airflow
> > Ecosystem
> > > > -
> > > > > Airflow Summit 2025 - Lightning Talk (5 min) -
> > > > >
> > > > >
> > > >
> >
> https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
> > > > > - [5] https://github.com/Alexhans/evals-playground
> > > > > - [6] https://ai-evals.io/
> > > > > - [7] https://github.com/Alexhans/eval-ception
> > > > > - [8] https://ai-evals.io/spec/cert/v0.1.0/schema.json
> > > > >
> > > > > Alex Guglielmone Nemi
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: [DISCUSS] AIP-102: Reproducible Benchmarks and Conformance for AI Capabilities

Reply via email to