[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Ilhan Polat via NumPy-Discussion Fri, 13 Feb 2026 10:23:17 -0800

I don't have any objections. Like I said, not everybody or anybody will
agree wtih me on this. But a few clarifications, please feel free to ignore
them. I really don't seek a debate in a mailing list but important for my
argument above, in case I did not make it clear.


> And - to return to my suggestion - I would argue here that your task, as
the PR author, is to say "I went through the ported code [...] origins."

Take my work; I would be lying. I can't even convince myself that I looked
at every line of code. Plus, that kind of homework is impossible to sign
out for me. How can I even know where to look if it is a partial code. We
are talking about Pull Requests that are typically changing 1 or 2
hypothetical tiles from the entire floor. I asked an LLM that gave me some
result. Meh, good enough, git push. If you ask me "where did you get the
idea of this function decorator" or "what made you do this double pointer
trick"; you will get an answer which will be a white lie. I think you are
underestimating how much work that sentence carries. The easier option is
clicking "I did not use AI".

> Right - and one conclusion we could draw is - OK, if (some idea of)
everyone is doing it, we should be doing it too.   But I'm sure you'd agree
that's not a very convincing argument.

That is exactly my argument, and much to my regret I think this is the only
honest answer. None of us proposed to use LLMs, they barged in on us.
Shifting the potential blame to the contributor while the code is 100%
coming from copyright infringing tool is not convincing either.

> Again, this isn't about enforcement - it's about ethics -as it always has
been.   We stated that we didn't accept GPL code, or code derived from GPL,
and we took our contributors word that they had taken our request seriously.


I don't think this is as ethical as you make it sound like. You are
removing the actual very hostile anti-copyright tool which is the main
perpetrator. Then we are hoping that the users will be holding it right,
and stealing maybe only a little but not too much to be disrespectful. In
my opinion, there is no ethical escape hatch that we can use without
participating in the act. The ethical thing here is to be transparent about
our desperation that we don't know what we are taking in and tread
carefully.

---
For the being late part

I can in fact bring you bunch of code and we can do a blindfold experiment,
you look at the code and tell me which code is GPL or from Numerical
Recipes. If you can't detect it, and I click all the checkboxes and then
congrats you license laundered copyrighted code by pushing it into
NumPy/SciPy/scikit-learn... Then authors of the original code somehow
recognize a unique trick, come back, point at the code and say, "Yo, this
is theft". That is what I mean by it is already too late. I get your point
that if we don't do anything then it is even worse. They will bring
anything which is true. But your belief in folks' ability to distinguish
different licenses if it is coming from an LLM is higher than mine.

> I didn't use those words - but in any case - as I've said several times,
in several places, the legal argument is more or less irrelevant to us -
our question is whether we are honoring the spirit of the copyrights put on
other people's code, not whether they could, with sufficient resources,
successfully sue us for infringement.

And I have endless respect for you and all who strive for this. I want to
believe that I am in your camp, or at least trying to be. My point is, if
this is done in spirit, let's make it obvious that we are powerless in case
a mistake is made OR no LLMs. Both are fine.




I do agree with your text, checkbox and other procedures you proposed and
would be willing to use it. I just can't see its practical function, even
ethically, in case a conflict arises or a fame-seeking contributor starts
pulling in copyrighted code. Because unlike before, the act of stealing is
now done within the LLM that has, for now, infinite impunity.





On Fri, Feb 13, 2026 at 5:22 PM Matthew Brett <[email protected]>
wrote:

> Hi Ilhan,
>
> On Fri, Feb 13, 2026 at 2:15 PM Ilhan Polat via NumPy-Discussion
> <[email protected]> wrote:
> >
> > Also I'd like to be on record with the unpleasant part out loud. I have
> been in many discussions also at work and in OSS circles so I have quite a
> bit of debate ammo accumulated from both sides. Let me jump into it without
> the fluff to save space;
> >
> > Currently, LLMs are getting really good at what they are tasked to do.
> If you put in the work (just like you would when you are the one writing
> the code), the output is quite acceptable and I feel like I'm reviewing
> somebody else's Pull request. Fix "this" part, change "that" part and done.
> If folks can't use these tools, it's a "they" problem.
>
> I'm not sure what you are saying there - could you clarify?
>
> > I just used it to translate entire LAPACK to C11 (why, mostly for the
> lolz, don't ask, it's a disease), ported all the tests and passing, now
> polishing it up. I mean look at this silly thing
> >
> >
> >
> >
> > No way in hell, I'd type this much code myself. And it is a 1-to-1
> mechanical translation, no creativity involved except hacking into PyData
> theme because I always wanted to tweak it. Now who owns the copyright;
> Dennis Ritchie or LAPACK folks, or is it the entire C codebase of the world
> that trained this machine to write this mechanical code, or is it me who
> paid for it and worked with it etc.? The source of the algorithm is BSD3,
> would you be using this if this was available in BSD3 (I mean it will be
> obviously very soon).
> >
> > As a comparison, the entire SciPy Fortran codebase, ~85,000 SLOC, took
> me 2 years and 7 months to translate manually. Entire LAPACK codebase
> 300,000 SLOC (just the functions) and including the testing, documentation
> etc. took me exactly 1 month and 19 days (Claude Pro something MAX level
> subscription with ~200€ per month from my own pocket). The agent still
> fails spectacularly if you let it run free, but I do put in the work to do
> a proper code review, tweak rules, then force it to read the rules
> periodically, (and most importantly, I know what I am looking at) so this
> went fairly well. It still took insane amount of time to bring the agent
> back on track. force explicit testing, Not to use C++ practices on C code
> so on.
> >
> > At this point, I can confirm that "Agents can do this much but they
> cannot do that much" is rapidly becoming a "God of the Gaps" argument with
> every new version release LLMs chasing a receding horizon, not towards
> intelligence, but precision at parsing and following orders.
>
> I just wanted to clarify that I don't think the argument is about what
> agents can and cannot do.   I think everyone believes they can be very
> useful.
>
> I should also say that your experience is very useful for the
> discussion - but it is somewhat specialized.   I can well see that the
> AI agent could be a huge boon for this sort of semi-mechanical task,
> but there aren't many such tasks in the code that I'm working on.
> And - to return to my suggestion - I would argue here that your task,
> as the PR author, is to say "I went through the ported code very
> carefully, comparing to the original, and I am confident that the
> translation is a faithful language to language translation, from the
> original BSD code, and there is no significant injection of other code
> that may be subject to copyright.   The closest example I could find
> was X, but a quick search for terms Y and Z found no plausible
> copyrighted origins."
>
> > However, in my opinion, our dilemma is not a whether their output is
> potentially GPL'd/copyrighted code or not. Every bit of output of these
> tools is stolen by being trained on copyrighted data. For the folks who did
> not see it, there is a screenshot of VS Code offering me a comment at the
> beginning of the file from a company that does not apparently have any
> public repositories
> https://discuss.scientific-python.org/t/a-policy-on-generative-ai-assisted-contributions/1702/5
> >
> > Therefore, we are, in fact, trying to guess, whether it looks like a
> copyrighted code after the fact, ignoring where the code is pulled from.
> These companies pretty much stole everything; music, science articles, code
> (not just GPLd code, but private repositories), this, that, everything.
> Their practices were/are seriously unethical. It is not a political
> statement but facts. However, it seems like they are getting away with it,
> incredibly, even after they admitted it multiple times all the way at the
> CEO level (in particular, recently, SUNO CEO is pretty bullish, even
> defending why this stealing is fair use while individuals are rapidly being
> prosecuted for the same actions, not to mention Sci-Hub). And some of us
> are working for these companies or working for in the secondary circles.
>
> Right - and one conclusion we could draw is - OK, if (some idea of)
> everyone is doing it, we should be doing it too.   But I'm sure you'd
> agree that's not a very convincing argument.
>
> > Funnily enough, we are tasked with this mordant task of trying to come
> up with a stance on LLM usage. I claim that we should not be spending too
> much time on the epistemological aspects of LLM usage. I can't see any way
> other than being utilitarian about it. Because PRs keep coming and
> maintainers are also using it. So when stuck between a rock and a hardware,
> I think we should be admitting these properly and then choose a path
> knowingly fully aware that we might be making a mistake. Being open about
> the fact that we are going blind into this is probably make more sense
> instead of some serious sounding untested-unvetted legal text and
> checkboxes. Because really nobody knows when we will correct course, if
> ever.
>
> There may be such a legal text - but I don't think that's what I was
> proposing.  Again, this isn't about enforcement - it's about ethics -
> as it always has been.   We stated that we didn't accept GPL code, or
> code derived from GPL, and we took our contributors word that they had
> taken our request seriously.
>
> It really doesn't seem sensible to choose a policy that is obviously
> dangerous for copyright, and wait until it becomes obvious that we
> have damaged copyright.   Rather it seems more sensible to choose an
> option that is less dangerous for copyright, and wait to see, as the
> tools develop, whether we need to re-evaluate.  It really doesn't seem
> likely to me that the policy would stay in place long after it was
> causing the project harm.
>
> > So we can
> >
> > 1- "Stallman" it, with "no AI allowed" stance, while having absolutely
> no way of knowing how the code is generated. So it is a stance based on
> principles. I don't have a problem with it, and can accept it. It is a
> viable and respectable choice. The downside is we will be forcing people to
> lie. Because they will use it and we will not notice it until it is very
> late.
>
> I just don't think this is true - I strongly suspect that people who
> are attracted to open-source, and the open-source community, will not
> generally lie about how they made their contributions - any more than
> we have seen attempts to put GPL code into our BSD codebases.
>
> Bear in mind that the "until it is very late" problem is the one that
> will happen much more quickly with a more permissive policy.
>
> > 2- or find a sentence that is pragmatic enough; something like
> >     "Even if you used LLMs, you should be able to explain the changes
> yourself. LLM based PRs are held to heightened levels of scrutiny and lower
> levels of patience" or something offered in this thread.
> >     I can also accept this, it is also a viable option. The downside is
> that it will make us more hostile, as Sebastian mentioned, and paranoid.
> Occasionally, it will make us accuse innocent folks for using LLMs.
>
> Could you comment on the option that I was proposing - which is that
> anyone generating code with AI should justify the copyright risk, with
> relevant research as necessary?
>
> > Once we can choose this, then we can add agent markdowns, boilerplate
> responses and other details. But it seems like we got stuck at this choice
> level in our last attempts for a policy alignment. I would be much happier
> if we can be a bit more explicit and forthcoming about what we are doing
> and not make it an in vitro Open Source problem. We don't need to use
> strong words like stealing etc. obviously since there is no legal basis for
> it.
>
> I didn't use those words - but in any case - as I've said several
> times, in several places, the legal argument is more or less
> irrelevant to us - our question is whether we are honoring the spirit
> of the copyrights put on other people's code, not whether they could,
> with sufficient resources, successfully sue us for infringement.
>
> Cheers,
>
> Matthew
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to