I like the idea and also assume that we can adjust and improve rules and expectations over time.

I just fear that (soon) if AI costs are put to realistic price levels we need to check if contributors still have and get free AI bot access, else the idea is melting fast. (Low risk thoug, let's see if this happens we need to just change the approach... or look for funding)

On 04.03.26 08:13, Jarek Potiuk wrote:
  Another manual step (and bottleneck) in triaging PRs is that maintainers
will still need to approve CI runs on GitHub.

Great point ... and ... it's already handled :)  - look at my PR.

When - during the triage - the triager will see that workflow approval is
needed, my nice little tool will print the diff of the incoming PR on
terminal and ask the triager to confirm that there is nothing suspicious
and after saying "y" the workflow run will be approved.

J.


On Wed, Mar 4, 2026 at 3:35 AM Zhe-You Liu <[email protected]> wrote:

Hi all,

Thanks Jarek for bringing up the auto-triage idea!
Big +1 from me on the “let’s try” decision.

I really like this feature; it can help avoid copy‑pasting or repeatedly
writing similar instructions for contributors to fix baseline test
failures.

I had the same thoughts as Wei regarding flaky tests. Having deterministic
checks or automated comments should be enough to handle flaky test issues,
and contributors can still reach out on Slack to get their PRs reviewed, so
this should not be a problem.

Another manual step (and bottleneck) in triaging PRs is that maintainers
will still need to approve CI runs on GitHub. It doesn’t seem safe to fully
automate CI approval, as there could still be rare cases where an attacker
creates a vulnerable PR that logs environment variables during tests. Even
though we could use an LLM to check for these kinds of vulnerabilities
before approving a CI run, it is still not as safe as a manual review in
most cases (e.g. prompt injection attack). I’m not sure whether anyone has
a good idea for fully automated PR triaging -- for example, automatically
approving CI, periodically checking test baselines for quality (via the
`breeze pr auto-triage`), re‑approving CI as needed, and continuing this
loop until all CI checks are green.

Best regards,
Jason

On Tue, Mar 3, 2026 at 10:48 PM Vincent Beck <[email protected]> wrote:

I like the overall strategy, for sure the tool will need continuous
iterations to handle all the different scenarios. But this is definitely
needed, the number of open PRs just skyrocketed the last few months, it
is
very hard/impossible to keep track of everything.

On 2026/03/03 14:39:41 Jarek Potiuk wrote:

Thanks for bringing this up! Overall, I like this idea, but it's
worth
testing it for a bit before we enforce it, especially the LLM-verify
part.
Oh absolutely. My plan to introduce it is (after the community
hopefully
makes an overall "let's try" decision):

* The human triager is always in the loop, quickly reviewing comments
just
before they are posted to the user (until we achieve high confidence)
* I plan to run it myself as the sole triager for some time to perfect
it
and to pay much more attention initially. I will start with smaller
groups/areas of code and expand as we go - possibly adding more
maintainers
willing to participate in triaging and testing/improving the tool
* See how quickly we can do it on a regular basis - whether we need
several
triagers or perhaps one rotational triager handling all PRs from all
areas
at a time.
* Possibly further automate it. My assessment is that we will have 90%
of
deterministic "fails"—those we can easily automate without hesitation
once
the process and expectations will be in place. The LLM part is a bit
more
nuanced and we can decide after we try.

* The author ensures the PR passes ALL the checks and tests (i.e.
green).
It might sometimes mean we have to - even more quickly to `main`
breakages,
and probably provide some "status" info and exceptions when we know
main
is
broken.
Probably, we should exempt some checks that might be flaky?

Yeah - this part is a bit problematic - but we can likely add also an
easy
automated, deterministic check if the failure is happening for others.
Sending an automated comment like, "Please rebase now, the issue is
fixed,"
to the authors would be super useful when they see unrelated failures.
This
is something we **should** figure out during testing. There will be
plenty
of opportunities :D


* All PRs that do not meet this requirement will be converted to
Drafts
with automated suggestions (reviewed quickly and efficiently by a
triager) provided to the author on the next steps.
This will be super helpful! I also do it manually from time to time.

Yes. I believe converting to Draft is an extremely strong (but fair)
signal
to the author: "Hey, you have work to do.".

Also when this is accompanied by an actionable comment like, "Here is
what
you should do and here is the link describing it," it immediately
filters
out people who submit PRs without much work.

Surely - they might feed the comment into their agent anyway (or it can
read it automatically and act). But if our tool is faster and cheaper
and
more accurate (because of smart human in the driver's seat) than their
tools, we gain an upper hand.
And it should be faster - because we only check the expectation rather
than
figuring out what to do, which should be much faster.

Then in the worst case we will have continuous ping-pong (Draft ->
Undraft
-> Draft), but we will control how fast this loop runs. Generally, our
goal
should be to slow it down rather than respond immediately; for example,
running it daily or every two days is a good idea.

Effectively, if the PR is in the "ready for maintainer review" state,
the
maintainer should be quite certain, that the code quality, tests, etc.,
are
all good. Only then should they take a look (and they can immediately
say,
"No, this is not what we want")—and this is absolutely fine as well. We
should not optimize for contributors spending time on work we might not
accept. This is deliberately not a goal for me. This will automatically
mean that new contributors who want to contribute significant changes
will
mostly waste a lot of time and their PRs will be rejected.

This is largely what we are already doing, mostly because those PRs do
not
follow our "tribal knowledge," which the agent cannot easily derive.
Naturally new contributors should start with small, easy-to-complete
tasks.
that can be easily discarded if reviewers reject them. This is what we
always asked people to start with. So this approach with the triage
tool,
also largely supports this: someone new rewriting the proverbial
scheduler
will have to spend significant time ensuring "auto-triage" passes, only
to
have the idea completely rejected by the reviewer or be asked for a
complete rewrite. And this is perfectly fine. We always encouraged
newcomers to start with small tasks, learn the basics, and "grow" until
they were ready to propose bigger changes or split it into much smaller
chunks. With "auto-triage" this will be natural and expected, requiring
authors to invest more time and effort to reach the "ready for review"
status.

And I think it's absolutely fair and restores the balance we so much
need
now.



Best,
Wei

On Mar 3, 2026, at 9:34 PM, Jarek Potiuk <[email protected]> wrote:

*TL;DR; I propose a stricter (automation-assisted) approach for the
"ready
for review" state and clearer expectations for contributors
regarding
when
maintainers review PRs of non-collaborators.*

Following the
https://lists.apache.org/thread/8tzwwwd7jmtmfo4j9pzg27704g10vpr4
where I
showcased a tool that I claude-coded, I would like to have a
(possibly
short) discussion on this subject and reach a stage where I can
attempt
to
try the tool out.

*Why? *

Because we maintainers are overwhelmed and burning out, we no
longer
see
how our time invested in Airflow can bring significant returns to
us
(personally) and the community.

While some of us spend a lot of time reviewing, commenting on, and
merging
code, with the current rate of AI-generated PRs and other things we
do,
this is not sustainable. Also there is a mismatch—or lack of
clarity—regarding the quality expectations for the PRs we want to
review.
*Social Contract Issue*

We are a good (I think) open source project with a thriving
community
and a
great group of maintainers who are also friends and like to work
with
each
other but also are very open to bringing new community members in.
As
maintainers, we are willing to help new contributors grow and
generally
willing to spend some of our time doing so. This is the social
contract
we
signed up for as OSS maintainers and as committers for the Apache
Software
Foundation PMC. Community Over Code.

However, this social contract - this community-building aspect is
currently
heavily imbalanced because AI-generated content takes away time,
focus
and
energy from the maintainers. Instead of having meaningful
discussions in
PRs about whether changes are needed and communicating with people,
we
start losing time talking to - effectively - AI agents about
hundreds of
smaller and bigger things that should not be there in a first
place.
Currently - collaboration and community building suffer. Even if
real
people submit code generated by agents (which is becoming really
good,
fast
and cheap to produce), we simply lack the time as maintainers to
have
meaningful conversations with the people behind those agents.

Sometimes we lose time talking to agents. Sometimes we lose time on
talking
to people who have 0 understanding of what they are doing and
submitt
continuous crap, and we should not be having that conversation at
all. Sometimes, we just look at the number of PRs opened in a given
day
in
despair, dreading even trying to bring order to them.

And many of us also have some "work" to do or a "feature" to work
on
top
of
that.

I think we need to reclaim the maintainers' collective time to
focus
on
what matters: delegating more responsibility to authors so they
meet
our
expected quality bar (and efficiently verifying it with tools
without
losing time and focus).

*What do we have now?*

We have already done a lot to help with it - AGENTS.The PR
guidelines,
overhauled by Kaxil and updated by others, will certainly help
clarify
expectations for agents in the future. I know Kaxil is also
exploring a
way
to enable automated copilot code reviews in a manner that will not
be too
"dehumanizing" and will work well. This is all good. The better the
agents
people use and the more closely they follow those instructions, the
higher
the quality of incoming PRs will be. But we also need to help
maintainers
easily identify what to focus on—distinguishing work in progress
and
unfinished PRs that need work from those truly "Ready for (human)
review."
*How?*

My proposal has two parts:

* Define and communicate expectations for PRs that maintainers can
manage.
* Relentlessly automate it to ensure expectations are met and that
maintainers can easily focus on those PRs that "Ready for review."

My tool (needs a bit more fine-tuning and refinement):
https://github.com/apache/airflow/pull/62682 `*breeze pr
auto-triage*`
is
designed to do exactly this: automate those expectations by
auto-triaging
the PRs. It not only converts them to Draft when they are not yet
"Ready
For Review," but also provides actionable, automated
(deterministic +
LLM)
comments to the authors. A concrete maintainer (the current
triager)
is
using the tool very efficiently.

*Proposed expectations (for non-collaborators):*

Those are not "new" expectations. Really, I'm proposing we
completely
delegate the responsibility for fulfilling those expectations to
the
author
(with helpful, automated comments - reviewed and confirmed by a
human
triager for now). And simply be very clear that generally no
maintainer
will look at a PR until:

* The author ensures the PR passes ALL the checks and tests (i.e.
green).
It might sometimes mean we have to - even more quickly to `main`
breakages,
and probably provide some "status" info and exceptions when we know
main
is
broken.

* The author follows all PR guidelines (LLM-verified) regarding
description, content, quality, and presence of tests.

* All PRs that do not meet this requirement will be converted to
Drafts
with automated suggestions (reviewed quickly and efficiently by a
triager) provided to the author on the next steps.

* Drafts with no activity will be more aggressively pruned by our
stalebot.
The triager is there mostly to quickly assess and generate
comments—with
tool/AI assistance. The triager won't be the one who actually
reviews
those
PRs when they are "ready for review."

* Only after that do we mark the PR as "*ready for maintainer
review*"
(label)

* Only such PRs should be reviewed and it is entirely up to the
author to
make them ready.

Note: This approach is only for non-collaborators. For
collaborators: we
might have just one expectation - mark your PR with "ready for
maintainer
review" when you think it's ready.
We accept people as committers and collaborators because we already
know
they generally know and follow the rules; automating this step
isn't
necessary.

This is nothing new; we've already been doing this with humans
handling
all
the heavy lifting without much of strictness or organization, but
this is
no longer sustainable.

I propose we make the expectations explicit, communicate them
clearly,
and
relentlessly automate their execution.

I would love to hear what y'all think.

J.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to