[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Mon, 23 Feb 2026 06:14:39 -0800

Hi,

On Mon, Feb 23, 2026 at 1:52 PM Sebastian Berg
<[email protected]> wrote:
>
> On Mon, 2026-02-23 at 11:59 +0000, Matthew Brett via NumPy-Discussion
> wrote:
> > Hi,
> >
>
> <snip>
>
> >
> > The central point is - who takes responsibility for copyright, in PRs
> > with AI-generated code?   We know that can't be the maintainers - so
> > it can only be the contributors.  This has always been the case, of
> > course, but now we're in a very different situation, where it's very
> > easy to end up with copyright code in the PR without realizing it.
> > And at the moment, by waving at some documentation, without further
> > instruction, we can surely predict from the data we have, that the
> > contributor will not be likely to take effective responsibility for
> > copyright.  Therefore, there's a substantial risk of copyright leak,
> > roughly proportional to the lines of non-trivial, non-mechanical code
> > in the PRs.
> >
> > Back to the options - 1) don't worry about it because it's not
> > important, or 2) put guards in place to make sure the contributor has
> > carefully and personally reviewed the PR for copyright concerns, and
> > review often to make sure these are effective or 3) delay allowing
> > large AI-generated blocks of code until we can see a way forward for
> > copyright.
> >
> > It's clear I think that some of us are in the 1) case - don't worry
> > about it.  I'm absolutely not in that camp, but hey.  All I'm saying
> > is - if we're in camp 1, we should make that clear.
>
>
> To me this seems very exaggerated.  Just because we don't put one issue
> (of multiple we have in practice) into the dead center of such a policy
> text (or PR template) doesn't equate to ignoring it?
> Both policies Ralf brought up include a statement about copyright. We
> still could decide to link out from there for the curious readers or
> adding guidelines somewhere when and where we should ask more
> questions.
>
> I believe all I said was that I don't want to overshoot and worry most
> contributors (because yeah, I truly think for the majority of PRs there
> is just not much concern).


Yes - I'm sure you're right - that most PRs won't have large
not-trivial AI code fragments, and for these, it's not a problem.

So for these - I imagine the author would not be deterred by stronger
statements about copyright - as it should be obvious to them that
copyright is unlikely to apply.

However, for PRs that do contain large non-trivial AI code fragments,
it is a problem - and that's what I was referring to specifically.

I'm sure you'll agree that general statements that people should be
careful about copyright or take ownership of copyright, for those PRs,
probably won't be effective in preventing copyright leak - without
more instructions about what to do, to detect copyright leak.

> Maybe we can word-smith something that strikes a good balance there.
> And yeah, we probably disagree where that balance lies.
> But if adding a brief note on copyright concerns in the policy is the
> same as saying "it's not important", I have no idea where to go?!
>
> This discussion tends to feel like we have to start discussion at the
> end of "it is a gigantic issue" and then maybe be allowed edge towards:
> OK, but this is the minimal thing that won't care off all contributors.
> When yeah, probably most here think it isn't the biggest issue, so we
> should edge towards: Yeah, maybe it is a bit bigger of an issue than it
> seems, so how about we add this sentence and link?
>
> But... to me transparency is the thing that is most directly helpful
> (also for social dynamics concerns). So that has to be central.
> Maybe it will be ignored, but what can you reasonably put up that won't
> be? My guess would be that light-weight is better because it decreases
> the chance of lying.

Well - yes - iff the author isn't using AI to submit the PR, or help
author the commit message, in which case we shouldn't use terms like
lying for the output - it's just whatever the AI came up with.

And transparency doesn't help much for (large non-trivial PR)
copyright, unless either a) the author has done some work that we
can't yet specify to look for copyright violations or b) the
maintainer notes the AI use and does that copyright work.   But at the
moment, we don't have much reason to think either of those will
happen.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to