Re: Accepting AI generated contributions

Bernardo Botella Wed, 23 Jul 2025 11:19:59 -0700

That’s a great point. I’d say we can use the co-authored part of our commit 
messages to disclose the actual AI that was used?




> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote:
> 
> Curious, what are the good ways to disclose the information? 
> 
> > All of which comes back to: if people disclose if they used AI, what 
> > models, and whether they used the code or text the model wrote verbatim or 
> > used it as a scaffolding and then heavily modified everything I think we'll 
> > be in a pretty good spot.
> 
> David is disclosing it in the maillist and the GH page. Should the disclosure 
> be persisted in the commit? 
> 
> - Yifan
> 
> On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com 
> <mailto:dcapw...@apple.com>> wrote:
>> Sent out this patch that was written 100% by Claude: 
>> https://github.com/apache/cassandra/pull/4266
>> 
>> Claudes license doesn’t have issues with the current ASF policy as far as I 
>> can tell.  If you look at the patch it’s very clear there isn’t any 
>> copywriter material (its glueing together C* classes).
>> 
>> I could have written this my self but I had to focus on code reviews and 
>> also needed this patch out, so asked Claude to write it for me so I could 
>> focus on reviews.  I have reviewed it myself and it’s basically the same 
>> code I would have written (notice how small and focused the patch is, larger 
>> stuff doesn’t normally pass my peer review).
>> 
>>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com 
>>> <mailto:dcapw...@apple.com>> wrote:
>>> 
>>> +1 to what Josh said
>>> Sent from my iPhone
>>> 
>>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org 
>>>> <mailto:jmcken...@apache.org>> wrote:
>>>> 
>>>> 
>>>> Did some more digging. Apparently the way a lot of headline-grabbers have 
>>>> been making models reproduce code verbatim is to prompt them with dozens 
>>>> of verbatim tokens of copyrighted code as input where completion is then 
>>>> very heavily weighted to regurgitate the initial implementation. Which 
>>>> makes sense; if you copy/paste 100 lines of copyrighted code, the 
>>>> statistically likely completion for that will be that initial 
>>>> implementation.
>>>> 
>>>> For local LLM's, the likelihood of verbatim reproduction is differently 
>>>> but apparently comparably unlikely because they have far fewer parameters 
>>>> (32B vs. 671B for Deepseek for instance) of their pre-training corpus of 
>>>> trillions (30T in the case of Qwen3-32B for instance), so the individual 
>>>> tokens from the copyrighted material are highly unlikely to be actually 
>>>> stored in the model to be reproduced, and certainly not in sequence. They 
>>>> don't have the post-generation checks claimed by the SOTA models, but are 
>>>> apparently considered in the "< 1 in 10,000 completions will generate 
>>>> copyrighted code" territory.
>>>> 
>>>> When asked a human language prompt, or a multi-agent pipelined "still 
>>>> human language but from your architect agent" prompt, the likelihood of 
>>>> producing a string of copyrighted code in that manner is statistically 
>>>> very, very low. I think we're at far more risk of contributors 
>>>> copy/pasting stack overflow or code from other projects than we are from 
>>>> modern genAI models producing blocks of copyrighted code.
>>>> 
>>>> All of which comes back to: if people disclose if they used AI, what 
>>>> models, and whether they used the code or text the model wrote verbatim or 
>>>> used it as a scaffolding and then heavily modified everything I think 
>>>> we'll be in a pretty good spot.
>>>> 
>>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>>>>> 
>>>>>> 2. Models that do not do output filtering to restrict the reproduction 
>>>>>> of training data unless the tool can ensure the output is license 
>>>>>> compatible?
>>>>>> 
>>>>>> 2 would basically prohibit locally run models.
>>>>> 
>>>>> 
>>>>> I am not for this for the reasons listed above. There isn’t a difference 
>>>>> between this and a contributor copying code and sending our way. We still 
>>>>> need to validate the code can be accepted .
>>>>> 
>>>>> We also have the issue of having this be a broad stroke. If the user 
>>>>> asked a model to write a test for the code the human wrote, we reject the 
>>>>> contribution as they used a local model? This poses very little 
>>>>> copywriting risk yet our policy would now reject
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws 
>>>>>> <mailto:ar...@weisberg.ws>> wrote:
>>>>>> 2. Models that do not do output filtering to restrict the reproduction 
>>>>>> of training data unless the tool can ensure the output is license 
>>>>>> compatible?
>>>>>> 
>>>>>> 2 would basically prohibit locally run models.
>>>> 
>>

Re: Accepting AI generated contributions

Reply via email to