[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Wed, 18 Feb 2026 16:07:51 -0800

Hi,

On Wed, Feb 18, 2026 at 10:33 PM Robert Kern via NumPy-Discussion
<[email protected]> wrote:
>
> On Wed, Feb 18, 2026 at 9:16 AM Matthew Brett <[email protected]> wrote:
>>
>> Hi,
>>
>> On Sat, Feb 14, 2026 at 5:38 PM Robert Kern <[email protected]> wrote:
>> >
>> > On Sat, Feb 14, 2026 at 12:17 PM Matthew Brett <[email protected]> 
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> On Fri, Feb 13, 2026 at 9:45 PM Robert Kern <[email protected]> wrote:
>> >> >
>> >> > On Wed, Feb 11, 2026 at 6:26 PM Matthew Brett via NumPy-Discussion 
>> >> > <[email protected]> wrote:
>> >> >>
>> >> >>
>> >> >> Just to clarify - in case it wasn't clear, what I'm floating as a 
>> >> >> proposal, would be something like this, as a message to PR authors:
>> >> >>
>> >> >> Please specify one of these:
>> >> >>
>> >> >> 1) I wrote this code myself, without looking at significant 
>> >> >> AI-generated code OR
>> >> >> 2) The code contains AI-generated content, but the AI-generated code 
>> >> >> is sufficiently trivial that it cannot reasonably be subject to 
>> >> >> copyright OR
>> >> >> 3) There is non-trivial AI-generated code in this PR, and I have 
>> >> >> documented my searches to confirm that no parts of the code are 
>> >> >> subject to existing copyright.
>> >> >>
>> >> >> So - the burden for the reviewer is just to confirm, in case 3, that 
>> >> >> the author has documented their searches.   We take the word of the 
>> >> >> contributor for the option they have chosen.   Obviously, the 
>> >> >> documentation requirement of case 3 is somewhat of a burden for the 
>> >> >> contributor, and may therefore encourage them to write the code 
>> >> >> themselves, to avoid that burden.  That might not be a bad thing, long 
>> >> >> term, for the project, and it seems reasonable to me as some defence 
>> >> >> against copyright violation, and a message that the project cares 
>> >> >> about such violation.
>> >> >
>> >> >
>> >> > For Case 3, I would love to see an example of the search that you would 
>> >> > accept. If you could take a recent PR (human or AI, doesn't really 
>> >> > matter for this purpose), and show the search that would satisfy you, 
>> >> > that would go a long way towards clarifying what you are asking for 
>> >> > here. We'd need a worked example or two before adopting this policy 
>> >> > because if I don't know what you are asking for, no new contributor 
>> >> > will, either.
>> >>
>> >> Yes, that's a reasonable request.   But how do you think I should
>> >> proceed?   Make an issue on Numpy, and start drafting?   Start another
>> >> email thread?  Or a Discourse / Scientific Python thread?
>> >
>> >
>> > Just here should be fine. Take an existing PR that has copyrightable 
>> > content (e.g. an entire new function or three, each more than ~10 lines, 
>> > not just many one-line updates scattered around; the most interesting ones 
>> > would be those that implement a known algorithm). Do the code search that 
>> > would satisfy you. Write out here what you would want a PR author to 
>> > provide.
>>
>> I'd suggested (off-list) that this might be better done in another
>> thread - but perhaps it can be done here.
>>
>> Reflecting, and experimenting - there are many caveats, but I think it
>> is reasonable to give the contributor some responsibility for formal
>> care about copyright.
>>
>> One way of doing that - is to ask some AI (if possible, an AI other
>> than the one generating the code) to review for copyright.  I've
>> experimented with that over at
>> https://github.com/numpy/numpthis looks 
>> likey/pull/30828#issuecomment-3920553882 .
>> But the idea would be that we ask a contributor who has generated code
>> by AI, to do this as part of the PR sign-off.   They should be in a
>> much better position to do this than the maintainers, as they should
>> have been exploring the problem themselves, and therefore should be
>> able to write better queries to guide the AI review.   And with the
>> prompts as a start, it's not particularly time-consuming.
>
>
> I think all of the arguments it produced are not grounded in the principles 
> of copyright law. Unfortunately, I think this is one of the areas where LLMs 
> just generate plausible nonsense rather than sound legal analysis. Each thing 
> that it noted was a one-liner or a general idea, nothing copyrightable. It's 
> essentially writes like a median StackOverflow programmer with a dim 
> understanding of copyright law (no slight intended to anyone; I am one). I've 
> looked at the two files it suggested, and I see no similarity to the PR.
>
> I do kind of suspect that LLMs could be used, with care, to help facilitate 
> the abstraction-filtration-comparison test and maybe finding candidates to do 
> that test on, but a general instruction to give arguments for copyright 
> violation apparently yields more chaff to wade through.


Yes, sure - and you can see me trying to negotiate with Gemini on
related points in an earlier session here:

https://gist.github.com/matthew-brett/fac33f1b41d98e51b842f8bb84e8c66b

My point was not that AI is doing a good job here - it isn't - but to
offer it as a starting point for further research for the PR author,
and reflection for those of us thinking about copyright and AI, on
what a better process might look like.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to