Hi, On Sat, Feb 21, 2026 at 9:34 AM Matthew Brett <[email protected]> wrote: > > Hi, > > On Thu, Feb 19, 2026 at 11:46 AM Ilhan Polat <[email protected]> wrote: > > > [...] > > > > Using LLM to find copyright violations is, with all respect, one of the > > most ferrous irony I have seen lately. Did you check whether JAX and > > PyArrow claim by the LLM is correct before you accuse the PR author, is > > there an actual code resemblance confirmed by a human? (not blaming you > > obviously but I am sure you see the recursion you are creating here) > > Yes, as you can imagine, I thought about that problem - of using AI to > detect copyright violations in AI. To me, that is only an irony if > we are thinking of a binary - AI-good, AI-bad. If I think AI-bad, > then I think that AI-generated contributions are bad, and therefore I > must also think that using AI as a jumping off point for copyright > assessment is bad. > > However, AI-bad is not what I think. I do think (is this > controversial?) that AI is unreliable, that, in typical use, without > careful discipline, it will tend to reduce learning and understanding > compared to doing the same task without AI, and that it can be useful, > if we take those things into account. > > Then I was thinking about the question that Evgeni (on the Scientific > Python Discourse forum) and Robert had asked - which is - fine, > copyright is an issue, but how can we reasonably ask the contributor > to assess that? > > That's a serious and difficult question. One option is to throw up > one's hands and say - OK - copyright is dead - let's ignore it, or at > least, deemphasise it. I don't think that's the right answer, which > leaves me with the urgent problem of how to proceed. > > Because this question is difficult, and it is very new (in the sense > it has now become very easy for good-faith submissions to violate > copyright) - it seems to me we will have to iterate. > > Then I asked myself - if I had to start somewhere - how would I > approach that problem? The way I tend to use AI, is as a jumping off > point - a starting point for a discussion with the AI. Quite often, > as in this case, that jumping off point is misleading or flat-out > wrong - but if you know that (are there any experienced users of AI > who don't know that?) - then you can start to negotiate with the AI, > and you will often, if you are careful, negotiate to something that > you can verify from reliable sources. > > You may have seen me taking that (I assume standard) approach in my > negotiations with Gemini in a previous conversation about copyright, > that I linked to as a Gist. > > Now, this is a new world we're in. I'm not saying that's a practical > approach for contributors to explore copyright. I think that I could > use it that way, and that I'd get closer to a reliable answer than if > I had not used it (and got no answer). I suspect, if we trust our > contributors, we will find we and they do develop good habits for that > use. But it's a genuinely open question whether that is so. As I > keep saying, my intention was only to raise the idea as a starting > point. And given the nature of AI - I therefore had to run the risk > that the relevant quoted AI (from a simple prompt and response) would > be misleading or wrong.
I should say that I'm aware that using AI for copyright assessment is very delicate. There is evidence (Xu et al - https://arxiv.org/abs/2408.02487) that 2024/5 vintage AI models were systematically less likely to correctly identify Copyleft licenses. Xu et al speculate that "some closed-source LLMs may have implemented post-processing steps to avoid acknowledging outputs derived from copyleft-licensed code snippets." Likewise, we know that OpenAI was reluctant to put AI-watermarks on ChatGPT output, with one suggested reason being surveys that predicted a large drop in use if the watermark was added : https://arstechnica.com/ai/2024/08/openai-has-the-tech-to-watermark-chatgpt-text-it-just-wont-release-it/ . And of course we have no way of knowing how the standard commercial models have been configured. Now imagine that we (open-source developers) start using AI to detect copyright violation, and that in turn leads to a reduction in use of AI tools by the open-source or commercial developers. It will be very difficult for us to know whether later versions of the models have been trained with the aim of making it less likely they will detect copyright violations, on the basis that less copyright violation detection leads to more use of AI. But perhaps that's a problem for a later time. And perhaps we can already become part of the negotiation with AI code model providers, on detection of copyright violation. Cheers, Matthew _______________________________________________ NumPy-Discussion mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3//lists/numpy-discussion.python.org Member address: [email protected]
