[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Fri, 13 Feb 2026 03:03:57 -0800

Hi,

On Thu, Feb 12, 2026 at 7:02 AM Sebastian Berg
<[email protected]> wrote:
>
> On Wed, 2026-02-11 at 23:22 +0000, Matthew Brett via NumPy-Discussion
> wrote:
> > Hi,
> >
> > On Wed, Feb 11, 2026 at 11:02 PM Lucas Colley via NumPy-Discussion <
> > [email protected]> wrote:
> >
> > > Hi Matthew,
> > >
> > > That all sounds reasonable to me so far, but what are the next
> > > steps?*
> > >
> > > > put a heavy requirement on contributors to either a) write the
> > > > code
> > > themselves, perhaps having asked for preliminary analysis (but not
> > > substantial code drafts) from AI
> > >
> > > Is this enforceable to a significant extent? If not, in what sense
> > > could
> > > it pose a genuinely ‘heavy requirement’?
> > >
> > > > or b) write the code with AI, but demonstrate that they have done
> > > > the
> > > research to establish the generated code does not breach copyright.
> > >
> > > Perhaps this is more enforceable? But to be honest it is still
> > > quite
> > > unclear to me how I would establish with certainty that code I’ve
> > > had
> > > generated does not breach copyright, much less code that is being
> > > presented
> > > to me by a contributor. Do you see how to realise a ‘heavy
> > > requirement’
> > > here?
> > >
> > > I agree with the spirit of the thought that the burden (if it is to
> > > exist)
> > > needs to be shifted away from maintainers, but it’s unclear to me
> > > how we
> > > can actually shift it elsewhere.
> > >
> > > As we discussed last year, I think we have a start at a decent
> > > argument
> > > towards including a checkbox in PR templates which contributors
> > > must tick
> > > to state that they recognise the risk of copyright violation via
> > > LLM
> > > generated code and take responsibility for the code they are
> > > submitting:
> > > https://github.com/matthew-brett/sp-ai-post/issues/2#issuecomment-2935428854
> > > .
> > >
> > > Even there though, there are still multiple debatable premises. Of
> > > course,
> > > we can hardly aim for some sort of logical proof of the right way
> > > forward,
> > > but I think we need more focused attention and argument towards a
> > > specific
> > > and understandable goal if we are to be able to come to consensus
> > > on some
> > > concrete steps forward. It is to this thread's merit that the
> > > discussion
> > > has been so varied and touched on many topics, but it is also
> > > demonstrative
> > > of the problem that broad and vague back-and-forths don’t really
> > > help
> > > settle on anything concrete.
> > >
> >
> > Just to clarify - in case it wasn't clear, what I'm floating as a
> > proposal,
> > would be something like this, as a message to PR authors:
> >
> > Please specify one of these:
> >
> > 1) I wrote this code myself, without looking at significant AI-
> > generated
> > code OR
> > 2) The code contains AI-generated content, but the AI-generated code
> > is
> > sufficiently trivial that it cannot reasonably be subject to
> > copyright OR
> > 3) There is non-trivial AI-generated code in this PR, and I have
> > documented
> > my searches to confirm that no parts of the code are subject to
> > existing
> > copyright.
> >
>
> While I am not particularly enthusiastic about focusing on copyright,
> adding such a checkbox on a PR, I would be happy with.
> (If it was focused on copyright, then it seems to me we would need to
> ask more things, like "I used a source, but it had no code" to "I used
> a source with code but I checked it's license". If we want this, I
> would prefer a single fuzzy sentence that links out to elsewhere that
> can also discuss pitfalls around copyright+AI.)
>
> Not sure that asking for a checkbox there will be honored, but I like
> the thought.
> First, it will increase the chance of getting the information (which I
> want as a reviewer).
> Second, my unfortunate feeling is that we'll get more aggressively/less
> friendly about closing PRs and that is a shame, and having the checkbox
> makes that pat a bit easier on us and maybe also more transparent to
> the user that we are struggling with this (the worry of course is
> closing a genuine human PR by accident).
>
> I think I largely understand the concerns around copyright and maybe I
> am a bit not careful/understanding enough by not being overly
> worried?...
> But to my very personal feeling the product of how much I feel we
> should worry and how much I feel that stressing issues will help us as
> a project/open source just doesn't make me enthusiastic about being
> aggressively to pointing it out these possible issues.
>
> There are many things to discuss around this. What does eroding
> copyright here mean for us as a project, for open source (GPL?), for
> open but not free code, for code that is leaked but sold?
> How will enforcement of actual copyright issues plays out in practice?
> I just don't think this is the venue for settling these questions [1]
> and I would need a lot more clarity to even form a strong opinion that
> I would be willing to announce to the world with the weight of NumPy
> behind it.


I do understand that this is not the kind of issue that many of us
enjoy discussing, but it seems to me that it is:

a) of central importance to the future of open-source, and
b) very urgent, and
c) fairly straightforward.

To focus the discussion - the only thing of interest to us here, is
the acceptability or otherwise of large chunks of code generated by
AI.    I doubt that anyone has strong objections to AI for code review
or code analysis.

For the central importance, imagine a world where copyright has become
irrelevant.  There are ways we could approach this issue, where that
is a likely outcome.   We might have different views on whether that
is acceptable, but at very least, it will be a very major change, with
unpredictable consequences.   We are used to open-source copyright as
it exists.   If we don't consciously address this now, or very soon,
we'll have another world, with consequences that are difficult to
predict.

Of course, some of us don't care all that much about our own
copyright, but bear in mind, that by choosing not to defend it, we
take away the ability of others to defend theirs.   Specifically, if
we choose to accept large AI-generated PRs, the copyright that will be
violated is not ours, but that of others.   Do we claim that right, to
void the copyright of our fellow authors?

Returning to the central question - of large AI-generated PRs.   It
seems to me this is not a week of work to analyze.   I don't think
there's any controversy that making no effort to control copyright
will, over the medium term, make copyright very difficult to honor.
As I said before, the legal issues of enforcement are difficult, but
not relevant to us, because we are considering our own ethics in
observing copyright, and that will be a superset of the legal
constraints.   It would be an error to defer to legal arguments for an
ethical question, if only because the legal arguments are sufficiently
complicated that we'd likely have lost the ability to enforce
copyright before they are resolved.   And, as I say, I think the legal
arguments - on enforcement - are more or less irrelevant to our
ethical decisions on copyright.

So, accepting large AI-generated PRs would be a significant threat to
copyright - what do we get in return?

Ralf pointed out one benefit - that we are not seen to disapprove of
the chosen workflows of our fellow developers.   I think this is a
weak argument.   It seems to me perfectly reasonable to point out that
contributing to the code-base has some constraints, and copyright is
one of them, and that AI-generated code runs the risk of violating
copyright.

The second potential benefit is that, by accepting large AI-generated
PRs, we will gain greatly in code coverage and quality, and that this
is a benefit great enough that it is worth paying the price in terms
of copyright.   First - we have been prepared to pay a high price for
observing copyright in the past - there are many GPL algorithms that
we could have copied, to our benefit, but did not.  Second, it seems
to me we can wait on this.   It is not yet clear that we would gain
significantly, compared to our traditional requirement that people
write their own code.   When the gains are still unclear, the cost in
terms of voiding copyright is too high.

Lastly - I was proposing a compromise - that we (Scientific Python
projects) do not forbid AI-generated PRs, but place an extra burden on
contributors to research any possible copyright violations.  That
seems like a reasonable compromise to me.   What do you think?

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to