On Wed, 2026-02-11 at 23:22 +0000, Matthew Brett via NumPy-Discussion wrote: > Hi, > > On Wed, Feb 11, 2026 at 11:02 PM Lucas Colley via NumPy-Discussion < > [email protected]> wrote: > > > Hi Matthew, > > > > That all sounds reasonable to me so far, but what are the next > > steps?* > > > > > put a heavy requirement on contributors to either a) write the > > > code > > themselves, perhaps having asked for preliminary analysis (but not > > substantial code drafts) from AI > > > > Is this enforceable to a significant extent? If not, in what sense > > could > > it pose a genuinely ‘heavy requirement’? > > > > > or b) write the code with AI, but demonstrate that they have done > > > the > > research to establish the generated code does not breach copyright. > > > > Perhaps this is more enforceable? But to be honest it is still > > quite > > unclear to me how I would establish with certainty that code I’ve > > had > > generated does not breach copyright, much less code that is being > > presented > > to me by a contributor. Do you see how to realise a ‘heavy > > requirement’ > > here? > > > > I agree with the spirit of the thought that the burden (if it is to > > exist) > > needs to be shifted away from maintainers, but it’s unclear to me > > how we > > can actually shift it elsewhere. > > > > As we discussed last year, I think we have a start at a decent > > argument > > towards including a checkbox in PR templates which contributors > > must tick > > to state that they recognise the risk of copyright violation via > > LLM > > generated code and take responsibility for the code they are > > submitting: > > https://github.com/matthew-brett/sp-ai-post/issues/2#issuecomment-2935428854 > > . > > > > Even there though, there are still multiple debatable premises. Of > > course, > > we can hardly aim for some sort of logical proof of the right way > > forward, > > but I think we need more focused attention and argument towards a > > specific > > and understandable goal if we are to be able to come to consensus > > on some > > concrete steps forward. It is to this thread's merit that the > > discussion > > has been so varied and touched on many topics, but it is also > > demonstrative > > of the problem that broad and vague back-and-forths don’t really > > help > > settle on anything concrete. > > > > Just to clarify - in case it wasn't clear, what I'm floating as a > proposal, > would be something like this, as a message to PR authors: > > Please specify one of these: > > 1) I wrote this code myself, without looking at significant AI- > generated > code OR > 2) The code contains AI-generated content, but the AI-generated code > is > sufficiently trivial that it cannot reasonably be subject to > copyright OR > 3) There is non-trivial AI-generated code in this PR, and I have > documented > my searches to confirm that no parts of the code are subject to > existing > copyright. >
While I am not particularly enthusiastic about focusing on copyright, adding such a checkbox on a PR, I would be happy with. (If it was focused on copyright, then it seems to me we would need to ask more things, like "I used a source, but it had no code" to "I used a source with code but I checked it's license". If we want this, I would prefer a single fuzzy sentence that links out to elsewhere that can also discuss pitfalls around copyright+AI.) Not sure that asking for a checkbox there will be honored, but I like the thought. First, it will increase the chance of getting the information (which I want as a reviewer). Second, my unfortunate feeling is that we'll get more aggressively/less friendly about closing PRs and that is a shame, and having the checkbox makes that pat a bit easier on us and maybe also more transparent to the user that we are struggling with this (the worry of course is closing a genuine human PR by accident). I think I largely understand the concerns around copyright and maybe I am a bit not careful/understanding enough by not being overly worried?... But to my very personal feeling the product of how much I feel we should worry and how much I feel that stressing issues will help us as a project/open source just doesn't make me enthusiastic about being aggressively to pointing it out these possible issues. There are many things to discuss around this. What does eroding copyright here mean for us as a project, for open source (GPL?), for open but not free code, for code that is leaked but sold? How will enforcement of actual copyright issues plays out in practice? I just don't think this is the venue for settling these questions [1] and I would need a lot more clarity to even form a strong opinion that I would be willing to announce to the world with the weight of NumPy behind it. - Sebastian [1] And yeah, this is laziness, because I feel to really settle it for even myself, I may have to spend weeks reading up and thinking about it. > So - the burden for the reviewer is just to confirm, in case 3, that > the > author has documented their searches. We take the word of the > contributor > for the option they have chosen. Obviously, the documentation > requirement > of case 3 is somewhat of a burden for the contributor, and may > therefore > encourage them to write the code themselves, to avoid that burden. > That > might not be a bad thing, long term, for the project, and it seems > reasonable to me as some defence against copyright violation, and a > message > that the project cares about such violation. > > Cheers, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3//lists/numpy-discussion.python.org > Member address: [email protected] _______________________________________________ NumPy-Discussion mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3//lists/numpy-discussion.python.org Member address: [email protected]
