On Wed, 2026-02-11 at 23:22 +0000, Matthew Brett via NumPy-Discussion
wrote:
> Hi,
> 
> On Wed, Feb 11, 2026 at 11:02 PM Lucas Colley via NumPy-Discussion <
> [email protected]> wrote:
> 
> > Hi Matthew,
> > 
> > That all sounds reasonable to me so far, but what are the next
> > steps?*
> > 
> > > put a heavy requirement on contributors to either a) write the
> > > code
> > themselves, perhaps having asked for preliminary analysis (but not
> > substantial code drafts) from AI
> > 
> > Is this enforceable to a significant extent? If not, in what sense
> > could
> > it pose a genuinely ‘heavy requirement’?
> > 
> > > or b) write the code with AI, but demonstrate that they have done
> > > the
> > research to establish the generated code does not breach copyright.
> > 
> > Perhaps this is more enforceable? But to be honest it is still
> > quite
> > unclear to me how I would establish with certainty that code I’ve
> > had
> > generated does not breach copyright, much less code that is being
> > presented
> > to me by a contributor. Do you see how to realise a ‘heavy
> > requirement’
> > here?
> > 
> > I agree with the spirit of the thought that the burden (if it is to
> > exist)
> > needs to be shifted away from maintainers, but it’s unclear to me
> > how we
> > can actually shift it elsewhere.
> > 
> > As we discussed last year, I think we have a start at a decent
> > argument
> > towards including a checkbox in PR templates which contributors
> > must tick
> > to state that they recognise the risk of copyright violation via
> > LLM
> > generated code and take responsibility for the code they are
> > submitting:
> > https://github.com/matthew-brett/sp-ai-post/issues/2#issuecomment-2935428854
> > .
> > 
> > Even there though, there are still multiple debatable premises. Of
> > course,
> > we can hardly aim for some sort of logical proof of the right way
> > forward,
> > but I think we need more focused attention and argument towards a
> > specific
> > and understandable goal if we are to be able to come to consensus
> > on some
> > concrete steps forward. It is to this thread's merit that the
> > discussion
> > has been so varied and touched on many topics, but it is also
> > demonstrative
> > of the problem that broad and vague back-and-forths don’t really
> > help
> > settle on anything concrete.
> > 
> 
> Just to clarify - in case it wasn't clear, what I'm floating as a
> proposal,
> would be something like this, as a message to PR authors:
> 
> Please specify one of these:
> 
> 1) I wrote this code myself, without looking at significant AI-
> generated
> code OR
> 2) The code contains AI-generated content, but the AI-generated code
> is
> sufficiently trivial that it cannot reasonably be subject to
> copyright OR
> 3) There is non-trivial AI-generated code in this PR, and I have
> documented
> my searches to confirm that no parts of the code are subject to
> existing
> copyright.
> 

While I am not particularly enthusiastic about focusing on copyright,
adding such a checkbox on a PR, I would be happy with.
(If it was focused on copyright, then it seems to me we would need to
ask more things, like "I used a source, but it had no code" to "I used
a source with code but I checked it's license". If we want this, I
would prefer a single fuzzy sentence that links out to elsewhere that
can also discuss pitfalls around copyright+AI.)

Not sure that asking for a checkbox there will be honored, but I like
the thought.
First, it will increase the chance of getting the information (which I
want as a reviewer).
Second, my unfortunate feeling is that we'll get more aggressively/less
friendly about closing PRs and that is a shame, and having the checkbox
makes that pat a bit easier on us and maybe also more transparent to
the user that we are struggling with this (the worry of course is
closing a genuine human PR by accident).

I think I largely understand the concerns around copyright and maybe I
am a bit not careful/understanding enough by not being overly
worried?...
But to my very personal feeling the product of how much I feel we
should worry and how much I feel that stressing issues will help us as
a project/open source just doesn't make me enthusiastic about being
aggressively to pointing it out these possible issues.

There are many things to discuss around this. What does eroding
copyright here mean for us as a project, for open source (GPL?), for
open but not free code, for code that is leaked but sold?
How will enforcement of actual copyright issues plays out in practice?
I just don't think this is the venue for settling these questions [1]
and I would need a lot more clarity to even form a strong opinion that
I would be willing to announce to the world with the weight of NumPy
behind it.

- Sebastian


[1] And yeah, this is laziness, because I feel to really settle it for
even myself, I may have to spend weeks reading up and thinking about
it.



> So - the burden for the reviewer is just to confirm, in case 3, that
> the
> author has documented their searches.   We take the word of the
> contributor
> for the option they have chosen.   Obviously, the documentation
> requirement
> of case 3 is somewhat of a burden for the contributor, and may
> therefore
> encourage them to write the code themselves, to avoid that burden. 
> That
> might not be a bad thing, long term, for the project, and it seems
> reasonable to me as some defence against copyright violation, and a
> message
> that the project cares about such violation.
> 
> Cheers,
> 
> Matthew
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3//lists/numpy-discussion.python.org
> Member address: [email protected]
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to