[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Robert Kern via NumPy-Discussion Wed, 18 Feb 2026 17:04:24 -0800

On Wed, Feb 18, 2026 at 7:03 PM Matthew Brett <[email protected]>
wrote:


> Hi,
>
> On Wed, Feb 18, 2026 at 10:33 PM Robert Kern via NumPy-Discussion
> <[email protected]> wrote:
> >
> > On Wed, Feb 18, 2026 at 9:16 AM Matthew Brett <[email protected]>
> wrote:
> >>
> >> One way of doing that - is to ask some AI (if possible, an AI other
> >> than the one generating the code) to review for copyright.  I've
> >> experimented with that over at
> >> https://github.com/numpy/numpy/pull/30828#issuecomment-3920553882 .
> >> But the idea would be that we ask a contributor who has generated code
> >> by AI, to do this as part of the PR sign-off.   They should be in a
> >> much better position to do this than the maintainers, as they should
> >> have been exploring the problem themselves, and therefore should be
> >> able to write better queries to guide the AI review.   And with the
> >> prompts as a start, it's not particularly time-consuming.
> >
> > I think all of the arguments it produced are not grounded in the
> principles of copyright law. Unfortunately, I think this is one of the
> areas where LLMs just generate plausible nonsense rather than sound legal
> analysis. Each thing that it noted was a one-liner or a general idea,
> nothing copyrightable. It's essentially writes like a median StackOverflow
> programmer with a dim understanding of copyright law (no slight intended to
> anyone; I am one). I've looked at the two files it suggested, and I see no
> similarity to the PR.
> >
> > I do kind of suspect that LLMs could be used, with care, to help
> facilitate the abstraction-filtration-comparison test and maybe finding
> candidates to do that test on, but a general instruction to give arguments
> for copyright violation apparently yields more chaff to wade through.
>
> Yes, sure - and you can see me trying to negotiate with Gemini on
> related points in an earlier session here:
>
> https://gist.github.com/matthew-brett/fac33f1b41d98e51b842f8bb84e8c66b
>
> My point was not that AI is doing a good job here - it isn't - but to
> offer it as a starting point for further research for the PR author,
> and reflection for those of us thinking about copyright and AI, on
> what a better process might look like.
>

IMO, it's definitely not a good starting point for the PR author. It
doesn't matter where it places you as a starting point if it points you in
the wrong direction. You are asking the PR author to defend against
incorrect statements of fact and law.

I think *some* kind of code search or plagiarism detection service might be
helpful in identifying possible original sources to compare with the
generatred output. It's not at all clear that asking the LLM as an oracle
actually enacts such a search. It plainly did not here, but it presented
its work as such.

I don't think it's a good policy to construct an ad hoc plagiarism
detection service without validating how it actually performs. I really
strongly suggest that you retract your PR comment. It would be one thing to
try it out and post here about what you found, but to interact with a
contributor that way as an experiment is... ill-advised.

-- 
Robert Kern

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to