That’s a great point. I’d say we can use the co-authored part of our commit messages to disclose the actual AI that was used?
> On Jul 23, 2025, at 10:57 AM, Yifan Cai <yc25c...@gmail.com> wrote: > > Curious, what are the good ways to disclose the information? > > > All of which comes back to: if people disclose if they used AI, what > > models, and whether they used the code or text the model wrote verbatim or > > used it as a scaffolding and then heavily modified everything I think we'll > > be in a pretty good spot. > > David is disclosing it in the maillist and the GH page. Should the disclosure > be persisted in the commit? > > - Yifan > > On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com > <mailto:dcapw...@apple.com>> wrote: >> Sent out this patch that was written 100% by Claude: >> https://github.com/apache/cassandra/pull/4266 >> >> Claudes license doesn’t have issues with the current ASF policy as far as I >> can tell. If you look at the patch it’s very clear there isn’t any >> copywriter material (its glueing together C* classes). >> >> I could have written this my self but I had to focus on code reviews and >> also needed this patch out, so asked Claude to write it for me so I could >> focus on reviews. I have reviewed it myself and it’s basically the same >> code I would have written (notice how small and focused the patch is, larger >> stuff doesn’t normally pass my peer review). >> >>> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com >>> <mailto:dcapw...@apple.com>> wrote: >>> >>> +1 to what Josh said >>> Sent from my iPhone >>> >>>> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org >>>> <mailto:jmcken...@apache.org>> wrote: >>>> >>>> >>>> Did some more digging. Apparently the way a lot of headline-grabbers have >>>> been making models reproduce code verbatim is to prompt them with dozens >>>> of verbatim tokens of copyrighted code as input where completion is then >>>> very heavily weighted to regurgitate the initial implementation. Which >>>> makes sense; if you copy/paste 100 lines of copyrighted code, the >>>> statistically likely completion for that will be that initial >>>> implementation. >>>> >>>> For local LLM's, the likelihood of verbatim reproduction is differently >>>> but apparently comparably unlikely because they have far fewer parameters >>>> (32B vs. 671B for Deepseek for instance) of their pre-training corpus of >>>> trillions (30T in the case of Qwen3-32B for instance), so the individual >>>> tokens from the copyrighted material are highly unlikely to be actually >>>> stored in the model to be reproduced, and certainly not in sequence. They >>>> don't have the post-generation checks claimed by the SOTA models, but are >>>> apparently considered in the "< 1 in 10,000 completions will generate >>>> copyrighted code" territory. >>>> >>>> When asked a human language prompt, or a multi-agent pipelined "still >>>> human language but from your architect agent" prompt, the likelihood of >>>> producing a string of copyrighted code in that manner is statistically >>>> very, very low. I think we're at far more risk of contributors >>>> copy/pasting stack overflow or code from other projects than we are from >>>> modern genAI models producing blocks of copyrighted code. >>>> >>>> All of which comes back to: if people disclose if they used AI, what >>>> models, and whether they used the code or text the model wrote verbatim or >>>> used it as a scaffolding and then heavily modified everything I think >>>> we'll be in a pretty good spot. >>>> >>>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >>>>> >>>>>> 2. Models that do not do output filtering to restrict the reproduction >>>>>> of training data unless the tool can ensure the output is license >>>>>> compatible? >>>>>> >>>>>> 2 would basically prohibit locally run models. >>>>> >>>>> >>>>> I am not for this for the reasons listed above. There isn’t a difference >>>>> between this and a contributor copying code and sending our way. We still >>>>> need to validate the code can be accepted . >>>>> >>>>> We also have the issue of having this be a broad stroke. If the user >>>>> asked a model to write a test for the code the human wrote, we reject the >>>>> contribution as they used a local model? This poses very little >>>>> copywriting risk yet our policy would now reject >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws >>>>>> <mailto:ar...@weisberg.ws>> wrote: >>>>>> 2. Models that do not do output filtering to restrict the reproduction >>>>>> of training data unless the tool can ensure the output is license >>>>>> compatible? >>>>>> >>>>>> 2 would basically prohibit locally run models. >>>> >>