[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Ilhan Polat via NumPy-Discussion Sat, 21 Feb 2026 06:11:36 -0800

I think we will not convince each other in this subject. My position is
still the same;


Ignoring the actual stealing done by these companies and then holding each
other accountable for copyright is a no-go for me, using it further to
resolve copyright issues, and thus washing it clean, is doubly so. I will
not entertain that option and that's my personal position. I don't expect
others to take this position. Defending copyright protection and ignoring
the largest copyrighted elephant in the room does not seem a sound argument
to me. Moreover, arguing over the tool as if it is oblivious and neutral is
just factually incorrect, there are companies behind these tools. The
technology is undoubtedly impressive and very useful, however the current
arrangement of it is built upon legally unsound basis with very shady
practices. Therefore I refuse to philosophize over it as a free-agent with
shortcomings. In my opinion, the possible thing we can do is at least be
defensive and if we feel like it, take advantage of LLMs to use it for
automating mundane tasks to save time that FOSS maintainers definitely lack
which is what I have been doing with LAPACK translation as I mentioned
above. I still go over them line by line which takes insane amount of time
though much shorter than writing it myself.

So "No AI contribution is allowed" is a valid take for me if that would be
the policy. Or "we will use common sense and make opinionated decisions,
for trivial and otherwise laborious tasks we don't care but for involved
bits, we won't touch it". It is also fine.

However, since we are not making any progress in this particular aspect of
the discussion, let's conclude it as inconclusive and go back to the policy
discussion.


On Sat, Feb 21, 2026 at 12:37 PM Matthew Brett <[email protected]>
wrote:

> Hi,
>
> On Sat, Feb 21, 2026 at 9:34 AM Matthew Brett <[email protected]>
> wrote:
> >
> > Hi,
> >
> > On Thu, Feb 19, 2026 at 11:46 AM Ilhan Polat <[email protected]>
> wrote:
> > >
> > [...]
> > >
> > > Using LLM to find copyright violations is, with all respect, one of
> the most ferrous irony I have seen lately. Did you check whether JAX and
> PyArrow claim by the LLM is correct before you accuse the PR author, is
> there an actual code resemblance confirmed by a human? (not blaming you
> obviously but I am sure you see the recursion you are creating here)
> >
> > Yes, as you can imagine, I thought about that problem - of using AI to
> > detect copyright violations in AI.   To me, that is only an irony if
> > we are thinking of a binary - AI-good, AI-bad.   If I think AI-bad,
> > then I think that AI-generated contributions are bad, and therefore I
> > must also think that using AI as a jumping off point for copyright
> > assessment is bad.
> >
> > However, AI-bad is not what I think.   I do think (is this
> > controversial?) that AI is unreliable, that, in typical use, without
> > careful discipline, it will tend to reduce learning and understanding
> > compared to doing the same task without AI, and that it can be useful,
> > if we take those things into account.
> >
> > Then I was thinking about the question that Evgeni (on the Scientific
> > Python Discourse forum) and Robert had asked - which is - fine,
> > copyright is an issue, but how can we reasonably ask the contributor
> > to assess that?
> >
> > That's a serious and difficult question.   One option is to throw up
> > one's hands and say - OK - copyright is dead - let's ignore it, or at
> > least, deemphasise it.   I don't think that's the right answer, which
> > leaves me with the urgent problem of how to proceed.
> >
> > Because this question is difficult, and it is very new (in the sense
> > it has now become very easy for good-faith submissions to violate
> > copyright) - it seems to me we will have to iterate.
> >
> > Then I asked myself - if I had to start somewhere - how would I
> > approach that problem?   The way I tend to use AI, is as a jumping off
> > point - a starting point for a discussion with the AI.   Quite often,
> > as in this case, that jumping off point is misleading or flat-out
> > wrong - but if you know that (are there any experienced users of AI
> > who don't know that?) - then you can start to negotiate with the AI,
> > and you will often, if you are careful, negotiate to something that
> > you can verify from reliable sources.
> >
> > You may have seen me taking that (I assume standard) approach in my
> > negotiations with Gemini in a previous conversation about copyright,
> > that I linked to as a Gist.
> >
> > Now, this is a new world we're in.   I'm not saying that's a practical
> > approach for contributors to explore copyright.  I think that I could
> > use it that way, and that I'd get closer to a reliable answer than if
> > I had not used it (and got no answer).  I suspect, if we trust our
> > contributors, we will find we and they do develop good habits for that
> > use.  But it's a genuinely open question whether that is so.  As I
> > keep saying, my intention was only to raise the idea as a starting
> > point.   And given the nature of AI - I therefore had to run the risk
> > that the relevant quoted AI (from a simple prompt and response) would
> > be misleading or wrong.
>
> I should say that I'm aware that using AI for copyright assessment is
> very delicate.   There is evidence (Xu et al -
> https://arxiv.org/abs/2408.02487) that 2024/5 vintage AI models were
> systematically less likely to correctly identify Copyleft licenses.
> Xu et al  speculate that "some closed-source LLMs may have implemented
> post-processing steps to avoid acknowledging outputs derived from
> copyleft-licensed code snippets."
>
> Likewise, we know that OpenAI was reluctant to put AI-watermarks on
> ChatGPT output, with one suggested reason being surveys that predicted
> a large drop in use if the watermark was added :
>
> https://arstechnica.com/ai/2024/08/openai-has-the-tech-to-watermark-chatgpt-text-it-just-wont-release-it/
> .
>
> And of course we have no way of knowing how the standard commercial
> models have been configured.
>
> Now imagine that we (open-source developers) start using AI to detect
> copyright violation, and that in turn leads to a reduction in use of
> AI tools by the open-source or commercial developers.   It will be
> very difficult for us to know whether later versions of the models
> have been trained with the aim of making it less likely they will
> detect copyright violations, on the basis that less copyright
> violation detection leads to more use of AI.
>
> But perhaps that's a problem for a later time.   And perhaps we can
> already become part of the negotiation with AI code model providers,
> on detection of copyright violation.
>
> Cheers,
>
> Matthew
>

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to