[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Mon, 23 Feb 2026 04:06:52 -0800

Hi,

On Mon, Feb 23, 2026 at 11:12 AM Sebastian Berg
<[email protected]> wrote:
>
> On Mon, 2026-02-23 at 09:07 +0000, Matthew Brett via NumPy-Discussion
> wrote:
> > Hi,
> >
> > On Mon, Feb 23, 2026 at 8:53 AM Sebastian Berg
> > <[email protected]> wrote:
> > >
> > > On Sun, 2026-02-22 at 10:19 -0500, Marten van Kerkwijk via NumPy-
> > > Discussion wrote:
> > > > Ralf Gommers via NumPy-Discussion <[email protected]>
> > > > writes:
> > > >
> > > > [snip]
> > > >
> > > > > I do think a web of trust is a potentially valuable idea.
> > > > > However,
> > > > > the need right now isn't
> > > > > there yet (at least for NumPy) and it does have the potential
> > > > > to
> > > > > close the door pretty strongly
> > > > > to newcomers. On the other hand, we already don't run CI on PRs
> > > > > from first-time
> > > > > contributors - that was something that turned out to be
> > > > > necessary
> > > > > to limit wasting
> > > > > resources. A web of trust is something to keep in mind in my
> > > > > opinion, and consider adopting
> > > > > if and when it becomes a clear win for maintainer load.
> > > >
> > > > Thanks for the reminder that we do not run CI for first-time
> > > > contributors.  That is nice in that there is already a mechanism
> > > > in
> > > > place to recognize those.  As an intermediate step towards trust
> > > > (but
> > > > not yet a web of it!), would it make sense to have a welcome
> > > > message
> > > > that asks the new contributor to introduce themselves by editing
> > > > their
> > > > top comment? I.e., something like this:
> > >
> > >
> > > Thanks for the suggestions on what concrete steps we should do (and
> > > I
> > > agree we should do something).
> > > I would be fine with basically adopting either of these, adopting
> > > the
> > > SciPy/SymPy one seems pragmatic, it is nice to keep things similar
> > > in
> > > similar projects. SymPy/LLVM do have a pretty clear note on
> > > copyright
> > > (not all do, I think). [1]
> > > (I like many things about the LLVM, it is nice explicit about
> > > reasoning, etc. but I guess that also makes it longer.)
> > >
> > > To me they honestly all get the important points across. And
> > > honestly,
> > > I suspect many contributors won't read it anyway, so it may be more
> > > used to point to in the rare case where you close a PR or so.
> > >
> > > To achieve better transparency, I would suggest we add check-boxes,
> > > E.g. sklearn has this now:
> > >
> > >     <!--
> > >     If AI tools were involved in creating this PR, please check all
> > > boxes that apply
> > >     below and make sure that you adhere to our Automated
> > > Contributions Policy:
> > >
> > > https://scikit-learn.org/dev/developers/contributing.html#automated-contributions-policy
> > >     -->
> > >     I used AI assistance for:
> > >     - [ ] Code generation (e.g., when writing an implementation or
> > > fixing a bug)
> > >     - [ ] Test/benchmark generation
> > >     - [ ] Documentation (including examples)
> > >     - [ ] Research and understanding
> > >
> > > I am not sure how well it is used, but I think that is a good start
> > > to
> > > see where it goes. I could imagine trying to put in something about
> > > the
> > > scope of AI use, but I am not sure if it matters. It may be easier
> > > to
> > > just follow up for PRs where it is unclear.
> > >
> > > (FWIW, I like the comment asking for a bit of personal context, it
> > > feels both helpful and welcoming! But I think when it comes to AI
> > > specifically, I would start with the check-boxes for pragmatism.)
> >
> > Unfortunately, as Oscar's example showed (and other slop PRs seem to
> > confirm), it looks as though the check-boxes will be entirely
> > useless,
> > as the AI is perfectly capable of filling those out for you, and
> > won't
> > worry (as far as we know) about choosing the result most likely to
> > get
> > the PR merged.
> >
> > That in turn has some major costs for maintainer burn-out - as
> > Maarten
> > and Matt H are pointing out.
> >
> > I'm increasingly leaning towards - no AI generated code at all,
> > unless
> > a) from a well-trusted contributor, and b) justified by that
> > contributor.
> >
> > > [1] I would be happy with linking out to continuing discussion
> > > towards
> > > the note in the LLVM one: "Artificial intelligence systems raise
> > > many
> > > questions around copyright that have yet to be answered"
> > > But I think that is about as much as I want to focus on that point
> > > in
> > > something targeted for contributors.
> >
> > Just flagging - but if we aren't asking the contributor to address
> > copyright, we have two options:
> >
> > * The maintainer does it.   I don't think there's any chance that
> > will
> > happen in practice.
> > * We effectively decide we aren't going to worry about AI copyright
> > violations.
> >
> > I realize the second option is the de-facto preference of some here,
> > but if that's so, I think we have to say that out loud.
>
>
> We can link out to the policy which would have a note on copyright. And
> that in turn could link out to further thoughts (heck, even the things
> like Peter Wang's talks).
>
> Maybe we can/should add a one sentence thing, but I want to be sure to
> not create undue uncertainty/fear for new contributors for an issue
> that, IMO, affects relatively few PRs because most PRs are just tiny
> bug-fixes.
>
> So, I think it would be good to discuss a concrete wording here. [1]
>
> To me it seems like an exaggeration to say we aren't going to worry
> about copyright. The question is to what degree it is helpful to force
> that worry on typical first time contributions.
> (And yes, there is also a question about how much we as a
> project/community should worry about it, but I think that is a separate
> discussion.)
>
> Asking for transparency can fail, but if the contributor lies about
> they will also ignore a ban and it gives us another reason to just
> close a PR or ban them.
> And if we want to draw a hard line for contributors, I don't like that
> because there is a vast range here:
> - used it for brain storming the approach
> - used tab completion in cursor
> - used it to start some tests but maybe 20% of lines remained.
> - Basically wrote the whole function and only tweaked it a bit.
>
> And additionally, Peter Wangs scope argument that we'll care much less
> about it if it is a docs website vs. core algorithm.
> The ask here can only be transparency anything beyond that could be
> guidelines that we put somewhere, but they would be guidelines to dig
> up maybe once every few months.
>
> and I would rather get to a world where we know what comes in (and then
> can still close a PR) than trying to draw a strict line.


I completely agree with your implication - that it's unlikely to be
useful to put a lot of documentation for people to read where they can
learn about AI risk from copyright.   I guess few people will read it,
and then a) they won't know what to do about it and b) they have no
obligation to do anything about it, so there is very little chance
they will act on it.

The central point is - who takes responsibility for copyright, in PRs
with AI-generated code?   We know that can't be the maintainers - so
it can only be the contributors.  This has always been the case, of
course, but now we're in a very different situation, where it's very
easy to end up with copyright code in the PR without realizing it.
And at the moment, by waving at some documentation, without further
instruction, we can surely predict from the data we have, that the
contributor will not be likely to take effective responsibility for
copyright.  Therefore, there's a substantial risk of copyright leak,
roughly proportional to the lines of non-trivial, non-mechanical code
in the PRs.

Back to the options - 1) don't worry about it because it's not
important, or 2) put guards in place to make sure the contributor has
carefully and personally reviewed the PR for copyright concerns, and
review often to make sure these are effective or 3) delay allowing
large AI-generated blocks of code until we can see a way forward for
copyright.

It's clear I think that some of us are in the 1) case - don't worry
about it.  I'm absolutely not in that camp, but hey.  All I'm saying
is - if we're in camp 1, we should make that clear.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to