[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Marten van Kerkwijk via NumPy-Discussion Sat, 14 Feb 2026 09:23:25 -0800

TL;DR: I suggest we move the discussion to how we implement a kind of
web of trust that has to be joined before one can post PRs.

I think the example Matthew posted of converted GPL R to BSD python is
obviously not OK, and I'm fine with stating explicitly that we don't
want PRs like that.  But I agree with others that any policy we adopt
would do (next to) nothing against bad intent, and that, more generally,
this is not something numpy can solve.  To me, it does not seem to
address the essence of the problem.

Rather, to me the discussion has clarified that the essence is *trust*.
Trust that someone does not knowlingly break copyright, and trust that
they are genuinely interested in contributing and thus have done the
work to ensure maintainer time is well spent.

I would suggest we take seriously implementing something like the
vouching system Robert pointed to, https://github.com/mitchellh/vouch
I include the "why" from its README below [1], as I find it well put.
(There may be something better, which perhaps relies on an existing "web
of trust", like PGP keys.  But let's decide first whether we want to do
this route.)

I think a big benefit of separating admission of people from submitting
PRs is that as a maintainer I do not have to be suspicious about intent;
as Sebastian noted, that removes the joy there can be in review (and as
I wrote before, is stopping me from reviewing PRs from accounts I do not
recognize).  I'm encouraged by Chuck thinking a vouching scheme would
not be enormously painful.  We do obviously need to think how we
actually go about vouching for a new contributor...

I should add that I passed on the vouch link to our discussion over at
astropy, https://github.com/astropy/astropy-project/issues/509, where
there was support for implementing (something like) it, with the
suggestion that we "share a network with scipy+numpy to minimize the
barrier to contributors we already trust and perhaps to share labor."
It was also noted that this system is similar to what is in place for
arXiv, which (so far) has worked reasonably well.

Related (but not directly on-topic, sorry!), in astronomy more generally
there has just been a thoughtful white paper posted by David Hogg on the
use of AI in astronomy: https://arxiv.org/pdf/2602.10181.  He makes
interesting points about why we do astronomy in the first place (stating
it is not just to get answers, in which case becoming a hedge fund
manager and hiring others to do the work would be more efficient...  As
in fact Simons did).  But also notes how for astropy (and numpy) it is
essential that we can trust that those packages do the right thing, and
that that trust is based on trusting those that constructed them.

All the best,

Marten

[1] The "why" from https://github.com/mitchellh/vouch

Open source has always worked on a system of trust and verify.

Historically, the effort required to understand a codebase, implement a change, 
and submit that change for review was high enough that it naturally filtered 
out many low quality contributions from unqualified people. For over 20 years 
of my life, this was enough for my projects as well as enough for most others.

Unfortunately, the landscape has changed particularly with the advent of AI 
tools that allow people to trivially create plausible-looking but extremely 
low-quality contributions with little to no true understanding. Contributors 
can no longer be trusted based on the minimal barrier to entry to simply submit 
a change.

But, open source still works on trust! And every project has a definite group 
of trusted individuals (maintainers) and a larger group of probably trusted 
individuals (active members of the community in any form). So, let's move to an 
explicit trust model where trusted individuals can vouch for others, and those 
vouched individuals can then contribute.

Ilhan Polat via NumPy-Discussion <[email protected]> writes:

> My answer to that is yes currently it is unfortunately OK. For it to be not 
> OK, the tool should
> have a license aware setting that when you flip it, you don't get copyrighted 
> answer and
> not to be trained on copyrighted code. And the user consciously uses the 
> steal mode. 
>
> Though yours is a bit dramatic, this is what happens when you query anything. 
> So not sure
> where we disagree. You are making my case. And this is my entire point that 
> these
> machines like a sundae machine take all sources (copyrighted/private or not) 
> in and give
> you an amalgam of intellectual property. 
>
> What I am trying to emphasize is that, we are trying to free the tool and 
> hold the
> contributor accountable. There are valid use cases, there are not valid use 
> cases but in all
> cases LLM did the stealing. Just because you asked it nicely or in a 
> ill-intentioned fashion
> does not change anything. We should not fool ourselves by the language we are 
> getting
> results out of these stochastic parrots. 
>
> I am not trying to devalue the copyright notion. I am trying to emphasize 
> that these
> guardrails we are putting up are doing nothing in terms of copyright other 
> than a bit of
> feel-good. 
>
> On Sat, Feb 14, 2026 at 5:15 PM Matthew Brett via NumPy-Discussion
> <[email protected]> wrote:
>
>  On Sat, Feb 14, 2026 at 4:11 PM Charles R Harris
>  <[email protected]> wrote:
>  >
>  >
>  >
>  > On Sat, Feb 14, 2026 at 9:04 AM Matthew Brett <[email protected]> 
> wrote:
>  >>
>  >> Hi,
>  >>
>  >> On Sat, Feb 14, 2026 at 4:01 PM Charles R Harris
>  >> <[email protected]> wrote:
>  >> >
>  >> >
>  >> >
>  >> > On Sat, Feb 14, 2026 at 6:53 AM Matthew Brett via NumPy-Discussion
>  <[email protected]> wrote:
>  >> >>
>  >> >> Hi,
>  >> >>
>  >> >> On Tue, Feb 10, 2026 at 3:48 PM Robert Kern <[email protected]> 
> wrote:
>  >> >> >
>  >> >> > On Tue, Feb 10, 2026 at 4:19 AM Matthew Brett via NumPy-Discussion
>  <[email protected]> wrote:
>  >> >> >>
>  >> >> >> A copyright thought experiment:
>  >> >> >>
>  >> >> >> I'm interested in porting a GPL R library to Python.   Prompt:
>  >> >> >>
>  >> >> >> "Take function `my.statistical.routine` from `mylibrary/mycode.R` 
> and
>  >> >> >> port it to Python.  The original code is GPL, but I want to license
>  >> >> >> your output code as BSD.  Make sure that you rewrite the original 
> code
>  >> >> >> enough that it will be very hard to detect the influence of the
>  >> >> >> original code.  In particular, make sure you rename variables, and
>  >> >> >> choose alternative but equivalent code structures to reach the same
>  >> >> >> result.   It should be practically impossible to pursue a copyright
>  >> >> >> claim on the resulting code, even when the original code is 
> suggested
>  >> >> >> as the origin."
>  >> >> >>
>  >> >> >> Is this an acceptable use of AI?
>  >> >> >
>  >> >> >
>  >> >> > No, clearly not. Nor would this be an acceptable use of vim or Emacs 
> for that
>  matter. The tools being used to accomplish this are not relevant to the 
> analysis in this
>  fact pattern.
>  >> >> >
>  >> >>
>  >> >> This example has proved more useful than I had thought.
>  >> >>
>  >> >> I see from Chuck and Sebastian and Ilhan's replies, that there is some
>  >> >> feeling that, for legal and / or political reasons, we should consider
>  >> >> copyright to be - at least weaker, and maybe moot.
>  >> >>
>  >> >> Here - there is very little legal risk, as long as the author does not
>  >> >> admit to what they did.
>  >> >>
>  >> >> So - Chuck, Sebastian, Ilhan - what do you think?  Is this use
>  >> >> acceptable?   And if not, why not?
>  >> >
>  >> >
>  >> >
>  >> > Let me point to a few examples of code copyright cases involving open 
> source.
>  >> >
>  >> > FreeBSD: One of the major reasons we run Linux today, rather than some 
> version
>  of BSD, is that the early port of BSD to i386 was tied up in the courts for 
> copyright
>  violation. I recall the initial announcement. The case ran for years.
>  >> > Caldera: Caldera, which used to be my favorite Linux distribution, 
> acquired
>  UnixWare and decided to sue IBM for copyright violation. They pointed to 
> small code
>  snippets. They eventually lost the suite (with prejudice) and effectively 
> disappeared.
>  But they could have derailed Linux.
>  >> >
>  >> >
>  >> > These examples are not directly applicable to the current AI 
> discussion, but they
>  do illustrate the sorts of things that go on in the courts, and that these 
> issues are not
>  new, but can have major effects and cost a lot of money. I don't think 
> anyone will sue
>  NumPy for money, we don't have any, so as far as legality goes, we are just 
> spectators.
>  Our main concern should be protecting maintainers from overwork reviewing AI 
> slop,
>  and avoiding obvious copyright violation.
>  >> >
>  >> > Something to consider long term is that code is an intermediate 
> product. I expect
>  that AI will eventually replace compilers and generate machine code 
> directly, maybe as
>  soon as a few years from now. Who can review that? At that point APIs and 
> standards
>  will become more important than code.
>  >> >
>  >> > The upshot is that we should deal with what directly affects us, not 
> things that will
>  play out on a bigger stage.
>  >>
>  >> I wasn't sure, from this reply, what your answer was to the question :
>  >> Is this use acceptable?   And if not, why not?
>  >>
>  >
>  > I am not going to play that game.
>
>  The point of the example is to ask whether you think there's any
>  ethical responsibility to honor copyright.  Robert thought yes, so do
>  I.    Is there any sense in which this is a trick question?
>
>  Cheers,
>
>  Matthew
>  _______________________________________________
>  NumPy-Discussion mailing list -- [email protected]
>  To unsubscribe send an email to [email protected]
>  https://mail.python.org/mailman3//lists/numpy-discussion.python.org
>  Member address: [email protected]
>
> [2:text/plain Hide]
>
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3//lists/numpy-discussion.python.org
> Member address: [email protected]
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to