[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Thu, 19 Feb 2026 02:24:24 -0800

Hi,

On Thu, Feb 19, 2026 at 2:48 AM Charles R Harris via NumPy-Discussion
<[email protected]> wrote:
>
>
>
> On Wed, Feb 18, 2026 at 6:04 PM Robert Kern via NumPy-Discussion 
> <[email protected]> wrote:
>>
>> On Wed, Feb 18, 2026 at 7:03 PM Matthew Brett <[email protected]> 
>> wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Feb 18, 2026 at 10:33 PM Robert Kern via NumPy-Discussion
>>> <[email protected]> wrote:
>>> >
>>> > On Wed, Feb 18, 2026 at 9:16 AM Matthew Brett <[email protected]> 
>>> > wrote:
>>> >>
>>> >> One way of doing that - is to ask some AI (if possible, an AI other
>>> >> than the one generating the code) to review for copyright.  I've
>>> >> experimented with that over at
>>> >> https://github.com/numpy/numpy/pull/30828#issuecomment-3920553882 .
>>> >> But the idea would be that we ask a contributor who has generated code
>>> >> by AI, to do this as part of the PR sign-off.   They should be in a
>>> >> much better position to do this than the maintainers, as they should
>>> >> have been exploring the problem themselves, and therefore should be
>>> >> able to write better queries to guide the AI review.   And with the
>>> >> prompts as a start, it's not particularly time-consuming.
>>> >
>>> > I think all of the arguments it produced are not grounded in the 
>>> > principles of copyright law. Unfortunately, I think this is one of the 
>>> > areas where LLMs just generate plausible nonsense rather than sound legal 
>>> > analysis. Each thing that it noted was a one-liner or a general idea, 
>>> > nothing copyrightable. It's essentially writes like a median 
>>> > StackOverflow programmer with a dim understanding of copyright law (no 
>>> > slight intended to anyone; I am one). I've looked at the two files it 
>>> > suggested, and I see no similarity to the PR.
>>> >
>>> > I do kind of suspect that LLMs could be used, with care, to help 
>>> > facilitate the abstraction-filtration-comparison test and maybe finding 
>>> > candidates to do that test on, but a general instruction to give 
>>> > arguments for copyright violation apparently yields more chaff to wade 
>>> > through.
>>>
>>> Yes, sure - and you can see me trying to negotiate with Gemini on
>>> related points in an earlier session here:
>>>
>>> https://gist.github.com/matthew-brett/fac33f1b41d98e51b842f8bb84e8c66b
>>>
>>> My point was not that AI is doing a good job here - it isn't - but to
>>> offer it as a starting point for further research for the PR author,
>>> and reflection for those of us thinking about copyright and AI, on
>>> what a better process might look like.
>>
>>
>> IMO, it's definitely not a good starting point for the PR author. It doesn't 
>> matter where it places you as a starting point if it points you in the wrong 
>> direction. You are asking the PR author to defend against incorrect 
>> statements of fact and law.
>>
>> I think *some* kind of code search or plagiarism detection service might be 
>> helpful in identifying possible original sources to compare with the 
>> generatred output. It's not at all clear that asking the LLM as an oracle 
>> actually enacts such a search. It plainly did not here, but it presented its 
>> work as such.
>>
>> I don't think it's a good policy to construct an ad hoc plagiarism detection 
>> service without validating how it actually performs. I really strongly 
>> suggest that you retract your PR comment. It would be one thing to try it 
>> out and post here about what you found, but to interact with a contributor 
>> that way as an experiment is... ill-advised.
>>
>
> +1. The interaction on that PR as a whole struck me as harsh, verging on rude.


You surely don't mean that it is harsh or rude to post the AI summary,
along with:  "Obviously - as designed - this is deliberately Red Team.
But @mdrdope - no pressure, and feel free not to answer - do you have
any response to the Gemini comments?"

That's one of the advantages of the asking the contributor themselves
to do that review - it makes it less likely that they will take
offense to the output of the AI.  Anyone using AI will know that it
will frequently be wrong, and it will be more obvious to them that the
AI output is not a judgment, but may serve as a starting point for
reflection and investigation.   For example, it may draw the author,
and the maintainers, into a more thoughtful and informed discussion of
copyright.

But perhaps you mean - the AI, and some of the other comments, implied
that the PR was largely AI-generated, and that was rude?   And - I
think this is what you are saying - you don't think it's important
whether it was, or was not AI generated, and therefore, trying to
establish the extent of AI use is harsh / rude?

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to