Re: Merging AI-generated program code

Flaming Hakama by Elaine Mon, 01 Jun 2026 17:32:16 -0700

Current Lilypond Pipeline Issue

I'd like to return to the actual concern:  is our workflow of automatically
integrating every contribution unless flagged by reviewers broken?


I wonder if we can be more clear about it since there are two implications
in this concern.


The first concern is: does it imply that someone who has access to our
pipeline cannot be trusted?

I would suspect not.  And if it were, then the remedy would be to remove
the user's permissions.  So this is not the concern we really need to solve
for, I don't think.

(If I am wrong and this is a problem, then we do have to solve this
problem.  Because if we were planning on granting access to people we don't
trust, then there will be no available solutions to any of our other
concerns.)



So, the main concern is:  we see a future where we have too few reviewers
and therefore some problematic PRs will not get flagged.

In my estimation, so long as the permissions/trust of contributors is under
control, I don't think this is an imminent concern.

However, I do suspect that it is inevitable that we will see more activity
and interest from AI-savvy developers.  Even an increase in contributions
of unproblematic code could easily overwhelm the capacity of our reviewers.

Since our reviewers tend to focus on their areas of expertise, it is not
like any single developer feels the responsibility to make sure every PR
gets reviewed, nor should they.  So we can imagine that this will be a
problem if PR size and frequency increases.

If we accept that this is a concern, and want to prevent it from becoming a
problem, then we will want to adjust our pipeline, or our management of it.


Options for Solving the Pipeline Issue

The nuclear approach is to do away with the automatic merging of the PR and
require all PRs to receive approvals.  But that would add friction to the
existing pipeline, and we may want to avoid that.

(please forgive any inaccurate use of terms like PR/approve/flag as I am
not intimately familiar with how our pipeline is implemented)


Which leaves the easiest way to address this concern under the current
pipeline is to appoint a reviewer who's sole job is to reject any PR that
makes it the the penultimate step without any comments from any other
reviewer;  Crucially, however, they would only reject PRs from contributors
who are not legacy trusted contributors.  Contributions from trusted
contributors will be let to go through silently as is customary.

This person could be a new volunteer, who's technical qualifications is
only to have familiarity with github, to avoid adding any burden to trusted
reviewers.

It is worth being clear about what kind of gatekeeping we are doing and why.

The current pipeline is designed for a small and well-trusted group of
developers.  They are valuable and we don't want to waste their time.

If we want to enforce more guardrails for new developers then we will need
to create some kind of distinction between trusted and untrusted developers.

We can start out being loose about this, because we kind of all know who
that is.  We could head towards establishing some kind of criteria around
metrics like commit history, etc. to have a coherent way of establishing
what "trusted" means.  But for now, let's just suppose that we have a
loyalty czar that will only flag unfamiliar contributors' PRs.



The second topic is about how futile it is to attempt to gatekeep the
amount of AI influence over contributors' code.

In all cases, the responsibility for the code lies with the contributor.

Again, if we are letting new developers in and don't trust them, that is an
upstream problem with a different solution.

The problem is a reviewer bandwidth problem.  The contribution is not
digestible enough.  For its size, it does not provide enough context for
someone to understand what problem is being addressed, what the
architecture design of the solution is, why this was chosen compared to
other approaches, what other considerations went into the design, what are
the compromises, how the feature is tested, and what the benchmark
performance is, docs and helpful snippets.


If the loyalty czar wants to facilitate a PR, they may have to figure out
who is most appropriate to review it, and bring their attention to it.

The reviewer should be able to articulate any concerns such any in addition
to those above, without spending much time delving in to understand the
code.  They can just raise potential concerns based on the scope of work
being done, edge cases, antipatterns, desired regtests, comparison of
approach to similar features, efficiency, backward compatibility, etc.
Whatever they would like to see that would help them to more easily be able
to make a determination whether the problem being solved is worthwhile and
is done in a coherent way.

If that conversation happens outside github, then the loyalty czar will
update the PR with the reviewer's concerns.

The contributor can then address the concerns.

Note that if the contributor used AI to generate the code, they will likely
use AI to generate the responses.  And if the contributor is using a
coherent setup, then the answers provided will make sense.  If the answers
make sense, then the code is also likely good enough.  If the reviewer has
literally any concerns that cannot be addressed to their satisfaction, then
they can leave it unmerged and ask for more improvements to code
consumability.

Crucially, we want to adopt a stance that the reviewers' time is most
valuable, and they can keep asking questions, and until they are satisfied,
the PR will not move.  It is up to the contributors to prepare their
contribution so that it is sufficiently consumable.  Contributors will need
to document doing enough due diligence so it is  commensurate with the
scope of their changes, and the whims of the reviewers.

If the code seems to work according to regtests, but the reviewer can't
comprehend it because of either style issues, naming conventions, or lack
of comments, it is valid to reject and ask for the code to be refactored
according to their standards.  Again, if the contributor is using a
productive contemporary development environment, then they will easily be
able to provide any requested documentation, or trivially make code
legibility requested changes.


The Future of Lilypond and AI

Rather than being suspect of contributors' use of AI, we should understand
what an AI-oriented development approach should be.

We might want to consider having a parallel pipeline for contemporary work.


I will hereby stop trying to distinguish how much AI is used to write
code.  It is a somewhat arbitrary label at this point, because it can refer
to anything from auto-complete in an IDE, to asking an LLM for code
suggestions that you cut & paste, to having an agent write the code...to
having agent orchestration that writes design docs, code, compares code to
design docs, writes unit tests, validates them, writes documentation,
checks for vulnerabilities, optimizes, and measures performance...

So much so that I would argue that not using AI in any capacity is becoming
a charming historical practice much like doing long division on paper.

So, I think that we want to have a legacy pipeline and a contemporary
pipeline

The legacy pipeline is our current trusted low-friction approach.  Publish
silently.

The contemporary one would have more structure and context.  PRs would need
to meet a high degree of standards and documentation to merge.



In terms of facilitating contributions, we would want to develop sets of
skills that represent aspects of lilypond development best practices.

Skills are an AI term for what is basically a set of instructions for
repeatable tasks.  Skills are used to populate the context window of LLM
requests and allow it to do the intended work in a consistent fashion.

Being able to provide this kind of clarity and consistency is what will
make lilypond survive vibe coding.


Elaine

Re: Merging AI-generated program code

Reply via email to