[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Charles R Harris via NumPy-Discussion Fri, 13 Feb 2026 10:59:46 -0800

On Fri, Feb 13, 2026 at 11:16 AM Matthew Brett via NumPy-Discussion <
[email protected]> wrote:


> Hi,
>
> On Fri, Feb 13, 2026 at 5:03 PM Charles R Harris via NumPy-Discussion
> <[email protected]> wrote:
> >
> >
> >
> > On Fri, Feb 13, 2026 at 7:08 AM Ilhan Polat via NumPy-Discussion <
> [email protected]> wrote:
> >>
> >> Also I'd like to be on record with the unpleasant part out loud. I have
> been in many discussions also at work and in OSS circles so I have quite a
> bit of debate ammo accumulated from both sides. Let me jump into it without
> the fluff to save space;
> >>
> >> Currently, LLMs are getting really good at what they are tasked to do.
> If you put in the work (just like you would when you are the one writing
> the code), the output is quite acceptable and I feel like I'm reviewing
> somebody else's Pull request. Fix "this" part, change "that" part and done.
> If folks can't use these tools, it's a "they" problem. I just used it to
> translate entire LAPACK to C11 (why, mostly for the lolz, don't ask, it's a
> disease), ported all the tests and passing, now polishing it up. I mean
> look at this silly thing
> >>
> >>
> >>
> >>
> >> No way in hell, I'd type this much code myself. And it is a 1-to-1
> mechanical translation, no creativity involved except hacking into PyData
> theme because I always wanted to tweak it. Now who owns the copyright;
> Dennis Ritchie or LAPACK folks, or is it the entire C codebase of the world
> that trained this machine to write this mechanical code, or is it me who
> paid for it and worked with it etc.? The source of the algorithm is BSD3,
> would you be using this if this was available in BSD3 (I mean it will be
> obviously very soon).
> >>
> >> As a comparison, the entire SciPy Fortran codebase, ~85,000 SLOC, took
> me 2 years and 7 months to translate manually. Entire LAPACK codebase
> 300,000 SLOC (just the functions) and including the testing, documentation
> etc. took me exactly 1 month and 19 days (Claude Pro something MAX level
> subscription with ~200€ per month from my own pocket). The agent still
> fails spectacularly if you let it run free, but I do put in the work to do
> a proper code review, tweak rules, then force it to read the rules
> periodically, (and most importantly, I know what I am looking at) so this
> went fairly well. It still took insane amount of time to bring the agent
> back on track. force explicit testing, Not to use C++ practices on C code
> so on.
> >>
> >> At this point, I can confirm that "Agents can do this much but they
> cannot do that much" is rapidly becoming a "God of the Gaps" argument with
> every new version release LLMs chasing a receding horizon, not towards
> intelligence, but precision at parsing and following orders.
> >>
> >> However, in my opinion, our dilemma is not a whether their output is
> potentially GPL'd/copyrighted code or not. Every bit of output of these
> tools is stolen by being trained on copyrighted data. For the folks who did
> not see it, there is a screenshot of VS Code offering me a comment at the
> beginning of the file from a company that does not apparently have any
> public repositories
> https://discuss.scientific-python.org/t/a-policy-on-generative-ai-assisted-contributions/1702/5
> >>
> >> Therefore, we are, in fact, trying to guess, whether it looks like a
> copyrighted code after the fact, ignoring where the code is pulled from.
> These companies pretty much stole everything; music, science articles, code
> (not just GPLd code, but private repositories), this, that, everything.
> Their practices were/are seriously unethical. It is not a political
> statement but facts. However, it seems like they are getting away with it,
> incredibly, even after they admitted it multiple times all the way at the
> CEO level (in particular, recently, SUNO CEO is pretty bullish, even
> defending why this stealing is fair use while individuals are rapidly being
> prosecuted for the same actions, not to mention Sci-Hub). And some of us
> are working for these companies or working for in the secondary circles.
> >>
> >> Funnily enough, we are tasked with this mordant task of trying to come
> up with a stance on LLM usage. I claim that we should not be spending too
> much time on the epistemological aspects of LLM usage. I can't see any way
> other than being utilitarian about it. Because PRs keep coming and
> maintainers are also using it. So when stuck between a rock and a hardware,
> I think we should be admitting these properly and then choose a path
> knowingly fully aware that we might be making a mistake. Being open about
> the fact that we are going blind into this is probably make more sense
> instead of some serious sounding untested-unvetted legal text and
> checkboxes. Because really nobody knows when we will correct course, if
> ever.
> >>
> >> So we can
> >>
> >> 1- "Stallman" it, with "no AI allowed" stance, while having absolutely
> no way of knowing how the code is generated. So it is a stance based on
> principles. I don't have a problem with it, and can accept it. It is a
> viable and respectable choice. The downside is we will be forcing people to
> lie. Because they will use it and we will not notice it until it is very
> late.
> >> 2- or find a sentence that is pragmatic enough; something like
> >>     "Even if you used LLMs, you should be able to explain the changes
> yourself. LLM based PRs are held to heightened levels of scrutiny and lower
> levels of patience" or something offered in this thread.
> >>     I can also accept this, it is also a viable option. The downside is
> that it will make us more hostile, as Sebastian mentioned, and paranoid.
> Occasionally, it will make us accuse innocent folks for using LLMs.
> >>
> >> Once we can choose this, then we can add agent markdowns, boilerplate
> responses and other details. But it seems like we got stuck at this choice
> level in our last attempts for a policy alignment. I would be much happier
> if we can be a bit more explicit and forthcoming about what we are doing
> and not make it an in vitro Open Source problem. We don't need to use
> strong words like stealing etc. obviously since there is no legal basis for
> it. But we all know what happened so there are much softer versions of
> saying the same thing. I just did not spend the time to make these proper
> ala Pascal, and it's my lack of manners leaking out though I strongly
> believe that they stole everything.
> >>
> >> I am fully aware that this might not be everyone's take (or anyone for
> that matter), so please take it as a rather brazen take though I hope the
> message gets across.
> >>
> >> Very weird times indeed.
> >>
> >>
> >> ilhan
> >
> >
> > I suspect there will be changes in the understanding/use of "copyright."
> What they will be, I don't know, but copyright itself is fairly recent. It
> is also the case that thirty years ago you could buy cheap, unlicensed
> versions of most software in Hong Kong, and copyrighted texts have been
> produced in cheap versions in some parts of the world, so these sorts of
> problems are not a completely new experience.
> >
> > Back in the late 1800s to early 1900s, there were patent fights in the
> Federal Courts involving electric lights, telephones, and aviation. But
> wartime need prevailed: "The disputes contributed to a 1917
> government-brokered patent pool during WWI to end litigation and support
> aircraft production." Copyright was also suspended for German texts in
> WWII, I have some republished works on my shelves.
> >
> > The use of AI will soon become a national interest, if it isn't already.
> We are small players in a much bigger event.
>
> Yes, that's right.  The way I've heard it discussed, by David Sacks,
> Trumps "AI and crypto tsar"
> (https://en.wikipedia.org/wiki/David_Sacks) is roughly that if we (the
> USA) don't make it possible for AI models to digest and possibly
> reproduce copyright material, the Chinese will, and then the USA will
> lose the "AI race", which would be bad.
>
> So it might well be that the current administration tries to undermine
> copyright, for that reason.   And I suppose they will do that by
> making copyright hard to enforce legally.  But that doesn't require us
> to void copyright - as I keep saying - it's an ethical issue more than
> a legal one.   We can still choose to respect the wishes of the
> author, even if (for example) the USA has made it impossible to
> enforce those wishes legally.
>

Copyright has been adjusted many times, most recently for such things as
photocopiers and home recording. My guess is that a combination of pooling
and fair use will be the solution, possibly with an opt-out option. The
current situation in the US is that AI regulation has been moved from the
states to the federal government.

Code should be free!
OK, it's free.
Wait, what! No, not like that.

Chuck

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to