[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Robert Kern via NumPy-Discussion Wed, 18 Feb 2026 14:33:48 -0800

On Wed, Feb 18, 2026 at 9:16 AM Matthew Brett <[email protected]>
wrote:


> Hi,
>
> On Sat, Feb 14, 2026 at 5:38 PM Robert Kern <[email protected]> wrote:
> >
> > On Sat, Feb 14, 2026 at 12:17 PM Matthew Brett <[email protected]>
> wrote:
> >>
> >> Hi,
> >>
> >> On Fri, Feb 13, 2026 at 9:45 PM Robert Kern <[email protected]>
> wrote:
> >> >
> >> > On Wed, Feb 11, 2026 at 6:26 PM Matthew Brett via NumPy-Discussion <
> [email protected]> wrote:
> >> >>
> >> >>
> >> >> Just to clarify - in case it wasn't clear, what I'm floating as a
> proposal, would be something like this, as a message to PR authors:
> >> >>
> >> >> Please specify one of these:
> >> >>
> >> >> 1) I wrote this code myself, without looking at significant
> AI-generated code OR
> >> >> 2) The code contains AI-generated content, but the AI-generated code
> is sufficiently trivial that it cannot reasonably be subject to copyright OR
> >> >> 3) There is non-trivial AI-generated code in this PR, and I have
> documented my searches to confirm that no parts of the code are subject to
> existing copyright.
> >> >>
> >> >> So - the burden for the reviewer is just to confirm, in case 3, that
> the author has documented their searches.   We take the word of the
> contributor for the option they have chosen.   Obviously, the documentation
> requirement of case 3 is somewhat of a burden for the contributor, and may
> therefore encourage them to write the code themselves, to avoid that
> burden.  That might not be a bad thing, long term, for the project, and it
> seems reasonable to me as some defence against copyright violation, and a
> message that the project cares about such violation.
> >> >
> >> >
> >> > For Case 3, I would love to see an example of the search that you
> would accept. If you could take a recent PR (human or AI, doesn't really
> matter for this purpose), and show the search that would satisfy you, that
> would go a long way towards clarifying what you are asking for here. We'd
> need a worked example or two before adopting this policy because if I don't
> know what you are asking for, no new contributor will, either.
> >>
> >> Yes, that's a reasonable request.   But how do you think I should
> >> proceed?   Make an issue on Numpy, and start drafting?   Start another
> >> email thread?  Or a Discourse / Scientific Python thread?
> >
> >
> > Just here should be fine. Take an existing PR that has copyrightable
> content (e.g. an entire new function or three, each more than ~10 lines,
> not just many one-line updates scattered around; the most interesting ones
> would be those that implement a known algorithm). Do the code search that
> would satisfy you. Write out here what you would want a PR author to
> provide.
>
> I'd suggested (off-list) that this might be better done in another
> thread - but perhaps it can be done here.
>
> Reflecting, and experimenting - there are many caveats, but I think it
> is reasonable to give the contributor some responsibility for formal
> care about copyright.
>
> One way of doing that - is to ask some AI (if possible, an AI other
> than the one generating the code) to review for copyright.  I've
> experimented with that over at
> https://github.com/numpy/numpthis looks
> likey/pull/30828#issuecomment-3920553882
> <https://github.com/numpy/numpy/pull/30828#issuecomment-3920553882> .
> But the idea would be that we ask a contributor who has generated code
> by AI, to do this as part of the PR sign-off.   They should be in a
> much better position to do this than the maintainers, as they should
> have been exploring the problem themselves, and therefore should be
> able to write better queries to guide the AI review.   And with the
> prompts as a start, it's not particularly time-consuming.
>

I think all of the arguments it produced are not grounded in the principles
of copyright law. Unfortunately, I think this is one of the areas where
LLMs just generate plausible nonsense rather than sound legal analysis.
Each thing that it noted was a one-liner or a general idea, nothing
copyrightable. It's essentially writes like a median StackOverflow
programmer with a dim understanding of copyright law (no slight intended to
anyone; I am one). I've looked at the two files it suggested, and I see no
similarity to the PR.

I do kind of suspect that LLMs could be used, with care, to help facilitate
the abstraction-filtration-comparison test
<https://en.wikipedia.org/wiki/Abstraction-Filtration-Comparison_test> and
maybe finding candidates to do that test on, but a general instruction to
give arguments for copyright violation apparently yields more chaff to wade
through.

-- 
Robert Kern

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to