Re: Accepting AI generated contributions

Yifan Cai Wed, 23 Jul 2025 10:57:27 -0700

Curious, what are the good ways to disclose the information?

> All of which comes back to: if people disclose if they used AI, what
models, and whether they used the code or text the model wrote verbatim or
used it as a scaffolding and then heavily modified everything I think we'll
be in a pretty good spot.


David is disclosing it in the maillist and the GH page. Should the
disclosure be persisted in the commit?

- Yifan

On Wed, Jul 23, 2025 at 8:47 AM David Capwell <dcapw...@apple.com> wrote:

> Sent out this patch that was written 100% by Claude:
> https://github.com/apache/cassandra/pull/4266
>
> Claudes license doesn’t have issues with the current ASF policy as far as
> I can tell.  If you look at the patch it’s very clear there isn’t any
> copywriter material (its glueing together C* classes).
>
> I could have written this my self but I had to focus on code reviews and
> also needed this patch out, so asked Claude to write it for me so I could
> focus on reviews.  I have reviewed it myself and it’s basically the same
> code I would have written (notice how small and focused the patch is,
> larger stuff doesn’t normally pass my peer review).
>
> On Jun 25, 2025, at 2:37 PM, David Capwell <dcapw...@apple.com> wrote:
>
> +1 to what Josh said
> Sent from my iPhone
>
> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote:
>
> 
> Did some more digging. Apparently the way a lot of headline-grabbers have
> been making models reproduce code verbatim is to prompt them with dozens of
> verbatim tokens of copyrighted code as input where completion is then very
> heavily weighted to regurgitate the initial implementation. Which makes
> sense; if you copy/paste 100 lines of copyrighted code, the statistically
> likely completion for that will be that initial implementation.
>
> For local LLM's, the likelihood of verbatim reproduction is *differently* but
> apparently comparably unlikely because they have far fewer parameters (32B
> vs. 671B for Deepseek for instance) of their pre-training corpus of
> trillions (30T in the case of Qwen3-32B for instance), so the individual
> tokens from the copyrighted material are highly unlikely to be actually
> *stored* in the model to be reproduced, and certainly not in sequence.
> They don't have the post-generation checks claimed by the SOTA models, but
> are apparently considered in the "< 1 in 10,000 completions will generate
> copyrighted code" territory.
>
> When asked a human language prompt, or a multi-agent pipelined "still
> human language but from your architect agent" prompt, the likelihood of
> producing a string of copyrighted code in that manner is statistically
> very, very low. I think we're at far more risk of contributors copy/pasting
> stack overflow or code from other projects than we are from modern genAI
> models producing blocks of copyrighted code.
>
> All of which comes back to: if people disclose if they used AI, what
> models, and whether they used the code or text the model wrote verbatim or
> used it as a scaffolding and then heavily modified everything I think we'll
> be in a pretty good spot.
>
> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>
>
> 2. Models that do not do output filtering to restrict the reproduction of
> training data unless the tool can ensure the output is license compatible?
>
> 2 would basically prohibit locally run models.
>
>
> I am not for this for the reasons listed above. There isn’t a difference
> between this and a contributor copying code and sending our way. We still
> need to validate the code can be accepted .
>
> We also have the issue of having this be a broad stroke. If the user asked
> a model to write a test for the code the human wrote, we reject the
> contribution as they used a local model? This poses very little copywriting
> risk yet our policy would now reject
>
> Sent from my iPhone
>
> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
> 2. Models that do not do output filtering to restrict the reproduction of
> training data unless the tool can ensure the output is license compatible?
>
> 2 would basically prohibit locally run models.
>
>
>
>

Re: Accepting AI generated contributions

Reply via email to