[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Sebastian Berg Mon, 23 Feb 2026 03:13:33 -0800

On Mon, 2026-02-23 at 09:07 +0000, Matthew Brett via NumPy-Discussion
wrote:
> Hi,
> 
> On Mon, Feb 23, 2026 at 8:53 AM Sebastian Berg
> <[email protected]> wrote:
> > 
> > On Sun, 2026-02-22 at 10:19 -0500, Marten van Kerkwijk via NumPy-
> > Discussion wrote:
> > > Ralf Gommers via NumPy-Discussion <[email protected]>
> > > writes:
> > > 
> > > [snip]
> > > 
> > > > I do think a web of trust is a potentially valuable idea.
> > > > However,
> > > > the need right now isn't
> > > > there yet (at least for NumPy) and it does have the potential
> > > > to
> > > > close the door pretty strongly
> > > > to newcomers. On the other hand, we already don't run CI on PRs
> > > > from first-time
> > > > contributors - that was something that turned out to be
> > > > necessary
> > > > to limit wasting
> > > > resources. A web of trust is something to keep in mind in my
> > > > opinion, and consider adopting
> > > > if and when it becomes a clear win for maintainer load.
> > > 
> > > Thanks for the reminder that we do not run CI for first-time
> > > contributors.  That is nice in that there is already a mechanism
> > > in
> > > place to recognize those.  As an intermediate step towards trust
> > > (but
> > > not yet a web of it!), would it make sense to have a welcome
> > > message
> > > that asks the new contributor to introduce themselves by editing
> > > their
> > > top comment? I.e., something like this:
> > 
> > 
> > Thanks for the suggestions on what concrete steps we should do (and
> > I
> > agree we should do something).
> > I would be fine with basically adopting either of these, adopting
> > the
> > SciPy/SymPy one seems pragmatic, it is nice to keep things similar
> > in
> > similar projects. SymPy/LLVM do have a pretty clear note on
> > copyright
> > (not all do, I think). [1]
> > (I like many things about the LLVM, it is nice explicit about
> > reasoning, etc. but I guess that also makes it longer.)
> > 
> > To me they honestly all get the important points across. And
> > honestly,
> > I suspect many contributors won't read it anyway, so it may be more
> > used to point to in the rare case where you close a PR or so.
> > 
> > To achieve better transparency, I would suggest we add check-boxes,
> > E.g. sklearn has this now:
> > 
> >     <!--
> >     If AI tools were involved in creating this PR, please check all
> > boxes that apply
> >     below and make sure that you adhere to our Automated
> > Contributions Policy:
> >    
> > https://scikit-learn.org/dev/developers/contributing.html#automated-contributions-policy
> >     -->
> >     I used AI assistance for:
> >     - [ ] Code generation (e.g., when writing an implementation or
> > fixing a bug)
> >     - [ ] Test/benchmark generation
> >     - [ ] Documentation (including examples)
> >     - [ ] Research and understanding
> > 
> > I am not sure how well it is used, but I think that is a good start
> > to
> > see where it goes. I could imagine trying to put in something about
> > the
> > scope of AI use, but I am not sure if it matters. It may be easier
> > to
> > just follow up for PRs where it is unclear.
> > 
> > (FWIW, I like the comment asking for a bit of personal context, it
> > feels both helpful and welcoming! But I think when it comes to AI
> > specifically, I would start with the check-boxes for pragmatism.)
> 
> Unfortunately, as Oscar's example showed (and other slop PRs seem to
> confirm), it looks as though the check-boxes will be entirely
> useless,
> as the AI is perfectly capable of filling those out for you, and
> won't
> worry (as far as we know) about choosing the result most likely to
> get
> the PR merged.
> 
> That in turn has some major costs for maintainer burn-out - as
> Maarten
> and Matt H are pointing out.
> 
> I'm increasingly leaning towards - no AI generated code at all,
> unless
> a) from a well-trusted contributor, and b) justified by that
> contributor.
> 
> > [1] I would be happy with linking out to continuing discussion
> > towards
> > the note in the LLVM one: "Artificial intelligence systems raise
> > many
> > questions around copyright that have yet to be answered"
> > But I think that is about as much as I want to focus on that point
> > in
> > something targeted for contributors.
> 
> Just flagging - but if we aren't asking the contributor to address
> copyright, we have two options:
> 
> * The maintainer does it.   I don't think there's any chance that
> will
> happen in practice.
> * We effectively decide we aren't going to worry about AI copyright
> violations.
> 
> I realize the second option is the de-facto preference of some here,
> but if that's so, I think we have to say that out loud.



We can link out to the policy which would have a note on copyright. And
that in turn could link out to further thoughts (heck, even the things
like Peter Wang's talks).

Maybe we can/should add a one sentence thing, but I want to be sure to
not create undue uncertainty/fear for new contributors for an issue
that, IMO, affects relatively few PRs because most PRs are just tiny
bug-fixes.

So, I think it would be good to discuss a concrete wording here. [1]

To me it seems like an exaggeration to say we aren't going to worry
about copyright. The question is to what degree it is helpful to force
that worry on typical first time contributions.
(And yes, there is also a question about how much we as a
project/community should worry about it, but I think that is a separate
discussion.)

Asking for transparency can fail, but if the contributor lies about
they will also ignore a ban and it gives us another reason to just
close a PR or ban them.
And if we want to draw a hard line for contributors, I don't like that
because there is a vast range here:
- used it for brain storming the approach
- used tab completion in cursor
- used it to start some tests but maybe 20% of lines remained.
- Basically wrote the whole function and only tweaked it a bit.

And additionally, Peter Wangs scope argument that we'll care much less
about it if it is a docs website vs. core algorithm.
The ask here can only be transparency anything beyond that could be
guidelines that we put somewhere, but they would be guidelines to dig
up maybe once every few months.

and I would rather get to a world where we know what comes in (and then
can still close a PR) than trying to draw a strict line.

- Sebastian


[1] I don't know, maybe in the (rough!) directino of:
It is important to us to know about AI use both for community reasons
as well as to for detailed review which may include copyright concerns
for large contributions to some parts of NumPy.


> Cheers,
> 
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3//lists/numpy-discussion.python.org
> Member address: [email protected]
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to