[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Fri, 13 Feb 2026 10:11:55 -0800

Hi,

On Fri, Feb 13, 2026 at 5:03 PM Charles R Harris via NumPy-Discussion
<[email protected]> wrote:
>
>
>
> On Fri, Feb 13, 2026 at 7:08 AM Ilhan Polat via NumPy-Discussion 
> <[email protected]> wrote:
>>
>> Also I'd like to be on record with the unpleasant part out loud. I have been 
>> in many discussions also at work and in OSS circles so I have quite a bit of 
>> debate ammo accumulated from both sides. Let me jump into it without the 
>> fluff to save space;
>>
>> Currently, LLMs are getting really good at what they are tasked to do. If 
>> you put in the work (just like you would when you are the one writing the 
>> code), the output is quite acceptable and I feel like I'm reviewing somebody 
>> else's Pull request. Fix "this" part, change "that" part and done. If folks 
>> can't use these tools, it's a "they" problem. I just used it to translate 
>> entire LAPACK to C11 (why, mostly for the lolz, don't ask, it's a disease), 
>> ported all the tests and passing, now polishing it up. I mean look at this 
>> silly thing
>>
>>
>>
>>
>> No way in hell, I'd type this much code myself. And it is a 1-to-1 
>> mechanical translation, no creativity involved except hacking into PyData 
>> theme because I always wanted to tweak it. Now who owns the copyright; 
>> Dennis Ritchie or LAPACK folks, or is it the entire C codebase of the world 
>> that trained this machine to write this mechanical code, or is it me who 
>> paid for it and worked with it etc.? The source of the algorithm is BSD3, 
>> would you be using this if this was available in BSD3 (I mean it will be 
>> obviously very soon).
>>
>> As a comparison, the entire SciPy Fortran codebase, ~85,000 SLOC, took me 2 
>> years and 7 months to translate manually. Entire LAPACK codebase 300,000 
>> SLOC (just the functions) and including the testing, documentation etc. took 
>> me exactly 1 month and 19 days (Claude Pro something MAX level subscription 
>> with ~200€ per month from my own pocket). The agent still fails 
>> spectacularly if you let it run free, but I do put in the work to do a 
>> proper code review, tweak rules, then force it to read the rules 
>> periodically, (and most importantly, I know what I am looking at) so this 
>> went fairly well. It still took insane amount of time to bring the agent 
>> back on track. force explicit testing, Not to use C++ practices on C code so 
>> on.
>>
>> At this point, I can confirm that "Agents can do this much but they cannot 
>> do that much" is rapidly becoming a "God of the Gaps" argument with every 
>> new version release LLMs chasing a receding horizon, not towards 
>> intelligence, but precision at parsing and following orders.
>>
>> However, in my opinion, our dilemma is not a whether their output is 
>> potentially GPL'd/copyrighted code or not. Every bit of output of these 
>> tools is stolen by being trained on copyrighted data. For the folks who did 
>> not see it, there is a screenshot of VS Code offering me a comment at the 
>> beginning of the file from a company that does not apparently have any 
>> public repositories 
>> https://discuss.scientific-python.org/t/a-policy-on-generative-ai-assisted-contributions/1702/5
>>
>> Therefore, we are, in fact, trying to guess, whether it looks like a 
>> copyrighted code after the fact, ignoring where the code is pulled from. 
>> These companies pretty much stole everything; music, science articles, code 
>> (not just GPLd code, but private repositories), this, that, everything. 
>> Their practices were/are seriously unethical. It is not a political 
>> statement but facts. However, it seems like they are getting away with it, 
>> incredibly, even after they admitted it multiple times all the way at the 
>> CEO level (in particular, recently, SUNO CEO is pretty bullish, even 
>> defending why this stealing is fair use while individuals are rapidly being 
>> prosecuted for the same actions, not to mention Sci-Hub). And some of us are 
>> working for these companies or working for in the secondary circles.
>>
>> Funnily enough, we are tasked with this mordant task of trying to come up 
>> with a stance on LLM usage. I claim that we should not be spending too much 
>> time on the epistemological aspects of LLM usage. I can't see any way other 
>> than being utilitarian about it. Because PRs keep coming and maintainers are 
>> also using it. So when stuck between a rock and a hardware, I think we 
>> should be admitting these properly and then choose a path knowingly fully 
>> aware that we might be making a mistake. Being open about the fact that we 
>> are going blind into this is probably make more sense instead of some 
>> serious sounding untested-unvetted legal text and checkboxes. Because really 
>> nobody knows when we will correct course, if ever.
>>
>> So we can
>>
>> 1- "Stallman" it, with "no AI allowed" stance, while having absolutely no 
>> way of knowing how the code is generated. So it is a stance based on 
>> principles. I don't have a problem with it, and can accept it. It is a 
>> viable and respectable choice. The downside is we will be forcing people to 
>> lie. Because they will use it and we will not notice it until it is very 
>> late.
>> 2- or find a sentence that is pragmatic enough; something like
>>     "Even if you used LLMs, you should be able to explain the changes 
>> yourself. LLM based PRs are held to heightened levels of scrutiny and lower 
>> levels of patience" or something offered in this thread.
>>     I can also accept this, it is also a viable option. The downside is that 
>> it will make us more hostile, as Sebastian mentioned, and paranoid. 
>> Occasionally, it will make us accuse innocent folks for using LLMs.
>>
>> Once we can choose this, then we can add agent markdowns, boilerplate 
>> responses and other details. But it seems like we got stuck at this choice 
>> level in our last attempts for a policy alignment. I would be much happier 
>> if we can be a bit more explicit and forthcoming about what we are doing and 
>> not make it an in vitro Open Source problem. We don't need to use strong 
>> words like stealing etc. obviously since there is no legal basis for it. But 
>> we all know what happened so there are much softer versions of saying the 
>> same thing. I just did not spend the time to make these proper ala Pascal, 
>> and it's my lack of manners leaking out though I strongly believe that they 
>> stole everything.
>>
>> I am fully aware that this might not be everyone's take (or anyone for that 
>> matter), so please take it as a rather brazen take though I hope the message 
>> gets across.
>>
>> Very weird times indeed.
>>
>>
>> ilhan
>
>
> I suspect there will be changes in the understanding/use of "copyright." What 
> they will be, I don't know, but copyright itself is fairly recent. It is also 
> the case that thirty years ago you could buy cheap, unlicensed versions of 
> most software in Hong Kong, and copyrighted texts have been produced in cheap 
> versions in some parts of the world, so these sorts of problems are not a 
> completely new experience.
>
> Back in the late 1800s to early 1900s, there were patent fights in the 
> Federal Courts involving electric lights, telephones, and aviation. But 
> wartime need prevailed: "The disputes contributed to a 1917 
> government-brokered patent pool during WWI to end litigation and support 
> aircraft production." Copyright was also suspended for German texts in WWII, 
> I have some republished works on my shelves.
>
> The use of AI will soon become a national interest, if it isn't already. We 
> are small players in a much bigger event.


Yes, that's right.  The way I've heard it discussed, by David Sacks,
Trumps "AI and crypto tsar"
(https://en.wikipedia.org/wiki/David_Sacks) is roughly that if we (the
USA) don't make it possible for AI models to digest and possibly
reproduce copyright material, the Chinese will, and then the USA will
lose the "AI race", which would be bad.

So it might well be that the current administration tries to undermine
copyright, for that reason.   And I suppose they will do that by
making copyright hard to enforce legally.  But that doesn't require us
to void copyright - as I keep saying - it's an ethical issue more than
a legal one.   We can still choose to respect the wishes of the
author, even if (for example) the USA has made it impossible to
enforce those wishes legally.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to