[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Matthew Brett via NumPy-Discussion Sat, 21 Feb 2026 01:41:18 -0800

Hi,

On Thu, Feb 19, 2026 at 11:46 AM Ilhan Polat <[email protected]> wrote:
>
[...]
>
> Using LLM to find copyright violations is, with all respect, one of the most 
> ferrous irony I have seen lately. Did you check whether JAX and PyArrow claim 
> by the LLM is correct before you accuse the PR author, is there an actual 
> code resemblance confirmed by a human? (not blaming you obviously but I am 
> sure you see the recursion you are creating here)


Yes, as you can imagine, I thought about that problem - of using AI to
detect copyright violations in AI.   To me, that is only an irony if
we are thinking of a binary - AI-good, AI-bad.   If I think AI-bad,
then I think that AI-generated contributions are bad, and therefore I
must also think that using AI as a jumping off point for copyright
assessment is bad.

However, AI-bad is not what I think.   I do think (is this
controversial?) that AI is unreliable, that, in typical use, without
careful discipline, it will tend to reduce learning and understanding
compared to doing the same task without AI, and that it can be useful,
if we take those things into account.

Then I was thinking about the question that Evgeni (on the Scientific
Python Discourse forum) and Robert had asked - which is - fine,
copyright is an issue, but how can we reasonably ask the contributor
to assess that?

That's a serious and difficult question.   One option is to throw up
one's hands and say - OK - copyright is dead - let's ignore it, or at
least, deemphasise it.   I don't think that's the right answer, which
leaves me with the urgent problem of how to proceed.

Because this question is difficult, and it is very new (in the sense
it has now become very easy for good-faith submissions to violate
copyright) - it seems to me we will have to iterate.

Then I asked myself - if I had to start somewhere - how would I
approach that problem?   The way I tend to use AI, is as a jumping off
point - a starting point for a discussion with the AI.   Quite often,
as in this case, that jumping off point is misleading or flat-out
wrong - but if you know that (are there any experienced users of AI
who don't know that?) - then you can start to negotiate with the AI,
and you will often, if you are careful, negotiate to something that
you can verify from reliable sources.

You may have seen me taking that (I assume standard) approach in my
negotiations with Gemini in a previous conversation about copyright,
that I linked to as a Gist.

Now, this is a new world we're in.   I'm not saying that's a practical
approach for contributors to explore copyright.  I think that I could
use it that way, and that I'd get closer to a reliable answer than if
I had not used it (and got no answer).  I suspect, if we trust our
contributors, we will find we and they do develop good habits for that
use.  But it's a genuinely open question whether that is so.  As I
keep saying, my intention was only to raise the idea as a starting
point.   And given the nature of AI - I therefore had to run the risk
that the relevant quoted AI (from a simple prompt and response) would
be misleading or wrong.

Cheers,

Matthew
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

[Numpy-discussion] Re: Current policy on AI-generated code in NumPy

Reply via email to