[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Fri, 13 Feb 2026 08:29:10 -0800

Hi Ilhan,

On Fri, Feb 13, 2026 at 2:15 PM Ilhan Polat via NumPy-Discussion
<[email protected]> wrote:
>
> Also I'd like to be on record with the unpleasant part out loud. I have been 
> in many discussions also at work and in OSS circles so I have quite a bit of 
> debate ammo accumulated from both sides. Let me jump into it without the 
> fluff to save space;
>
> Currently, LLMs are getting really good at what they are tasked to do. If you 
> put in the work (just like you would when you are the one writing the code), 
> the output is quite acceptable and I feel like I'm reviewing somebody else's 
> Pull request. Fix "this" part, change "that" part and done. If folks can't 
> use these tools, it's a "they" problem.


I'm not sure what you are saying there - could you clarify?

> I just used it to translate entire LAPACK to C11 (why, mostly for the lolz, 
> don't ask, it's a disease), ported all the tests and passing, now polishing 
> it up. I mean look at this silly thing
>
>
>
>
> No way in hell, I'd type this much code myself. And it is a 1-to-1 mechanical 
> translation, no creativity involved except hacking into PyData theme because 
> I always wanted to tweak it. Now who owns the copyright; Dennis Ritchie or 
> LAPACK folks, or is it the entire C codebase of the world that trained this 
> machine to write this mechanical code, or is it me who paid for it and worked 
> with it etc.? The source of the algorithm is BSD3, would you be using this if 
> this was available in BSD3 (I mean it will be obviously very soon).
>
> As a comparison, the entire SciPy Fortran codebase, ~85,000 SLOC, took me 2 
> years and 7 months to translate manually. Entire LAPACK codebase 300,000 SLOC 
> (just the functions) and including the testing, documentation etc. took me 
> exactly 1 month and 19 days (Claude Pro something MAX level subscription with 
> ~200€ per month from my own pocket). The agent still fails spectacularly if 
> you let it run free, but I do put in the work to do a proper code review, 
> tweak rules, then force it to read the rules periodically, (and most 
> importantly, I know what I am looking at) so this went fairly well. It still 
> took insane amount of time to bring the agent back on track. force explicit 
> testing, Not to use C++ practices on C code so on.
>
> At this point, I can confirm that "Agents can do this much but they cannot do 
> that much" is rapidly becoming a "God of the Gaps" argument with every new 
> version release LLMs chasing a receding horizon, not towards intelligence, 
> but precision at parsing and following orders.

I just wanted to clarify that I don't think the argument is about what
agents can and cannot do.   I think everyone believes they can be very
useful.

I should also say that your experience is very useful for the
discussion - but it is somewhat specialized.   I can well see that the
AI agent could be a huge boon for this sort of semi-mechanical task,
but there aren't many such tasks in the code that I'm working on.
And - to return to my suggestion - I would argue here that your task,
as the PR author, is to say "I went through the ported code very
carefully, comparing to the original, and I am confident that the
translation is a faithful language to language translation, from the
original BSD code, and there is no significant injection of other code
that may be subject to copyright.   The closest example I could find
was X, but a quick search for terms Y and Z found no plausible
copyrighted origins."

> However, in my opinion, our dilemma is not a whether their output is 
> potentially GPL'd/copyrighted code or not. Every bit of output of these tools 
> is stolen by being trained on copyrighted data. For the folks who did not see 
> it, there is a screenshot of VS Code offering me a comment at the beginning 
> of the file from a company that does not apparently have any public 
> repositories 
> https://discuss.scientific-python.org/t/a-policy-on-generative-ai-assisted-contributions/1702/5
>
> Therefore, we are, in fact, trying to guess, whether it looks like a 
> copyrighted code after the fact, ignoring where the code is pulled from. 
> These companies pretty much stole everything; music, science articles, code 
> (not just GPLd code, but private repositories), this, that, everything. Their 
> practices were/are seriously unethical. It is not a political statement but 
> facts. However, it seems like they are getting away with it, incredibly, even 
> after they admitted it multiple times all the way at the CEO level (in 
> particular, recently, SUNO CEO is pretty bullish, even defending why this 
> stealing is fair use while individuals are rapidly being prosecuted for the 
> same actions, not to mention Sci-Hub). And some of us are working for these 
> companies or working for in the secondary circles.

Right - and one conclusion we could draw is - OK, if (some idea of)
everyone is doing it, we should be doing it too.   But I'm sure you'd
agree that's not a very convincing argument.

> Funnily enough, we are tasked with this mordant task of trying to come up 
> with a stance on LLM usage. I claim that we should not be spending too much 
> time on the epistemological aspects of LLM usage. I can't see any way other 
> than being utilitarian about it. Because PRs keep coming and maintainers are 
> also using it. So when stuck between a rock and a hardware, I think we should 
> be admitting these properly and then choose a path knowingly fully aware that 
> we might be making a mistake. Being open about the fact that we are going 
> blind into this is probably make more sense instead of some serious sounding 
> untested-unvetted legal text and checkboxes. Because really nobody knows when 
> we will correct course, if ever.

There may be such a legal text - but I don't think that's what I was
proposing.  Again, this isn't about enforcement - it's about ethics -
as it always has been.   We stated that we didn't accept GPL code, or
code derived from GPL, and we took our contributors word that they had
taken our request seriously.

It really doesn't seem sensible to choose a policy that is obviously
dangerous for copyright, and wait until it becomes obvious that we
have damaged copyright.   Rather it seems more sensible to choose an
option that is less dangerous for copyright, and wait to see, as the
tools develop, whether we need to re-evaluate.  It really doesn't seem
likely to me that the policy would stay in place long after it was
causing the project harm.

> So we can
>
> 1- "Stallman" it, with "no AI allowed" stance, while having absolutely no way 
> of knowing how the code is generated. So it is a stance based on principles. 
> I don't have a problem with it, and can accept it. It is a viable and 
> respectable choice. The downside is we will be forcing people to lie. Because 
> they will use it and we will not notice it until it is very late.

I just don't think this is true - I strongly suspect that people who
are attracted to open-source, and the open-source community, will not
generally lie about how they made their contributions - any more than
we have seen attempts to put GPL code into our BSD codebases.

Bear in mind that the "until it is very late" problem is the one that
will happen much more quickly with a more permissive policy.

> 2- or find a sentence that is pragmatic enough; something like
>     "Even if you used LLMs, you should be able to explain the changes 
> yourself. LLM based PRs are held to heightened levels of scrutiny and lower 
> levels of patience" or something offered in this thread.
>     I can also accept this, it is also a viable option. The downside is that 
> it will make us more hostile, as Sebastian mentioned, and paranoid. 
> Occasionally, it will make us accuse innocent folks for using LLMs.

Could you comment on the option that I was proposing - which is that
anyone generating code with AI should justify the copyright risk, with
relevant research as necessary?

> Once we can choose this, then we can add agent markdowns, boilerplate 
> responses and other details. But it seems like we got stuck at this choice 
> level in our last attempts for a policy alignment. I would be much happier if 
> we can be a bit more explicit and forthcoming about what we are doing and not 
> make it an in vitro Open Source problem. We don't need to use strong words 
> like stealing etc. obviously since there is no legal basis for it.

I didn't use those words - but in any case - as I've said several
times, in several places, the legal argument is more or less
irrelevant to us - our question is whether we are honoring the spirit
of the copyrights put on other people's code, not whether they could,
with sufficient resources, successfully sue us for infringement.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to