Did some more digging. Apparently the way a lot of headline-grabbers have been 
making models reproduce code verbatim is to prompt them with dozens of verbatim 
tokens of copyrighted code as input where completion is then very heavily 
weighted to regurgitate the initial implementation. Which makes sense; if you 
copy/paste 100 lines of copyrighted code, the statistically likely completion 
for that will be that initial implementation.

For local LLM's, the likelihood of verbatim reproduction is *differently* but 
apparently comparably unlikely because they have far fewer parameters (32B vs. 
671B for Deepseek for instance) of their pre-training corpus of trillions (30T 
in the case of Qwen3-32B for instance), so the individual tokens from the 
copyrighted material are highly unlikely to be actually *stored* in the model 
to be reproduced, and certainly not in sequence. They don't have the 
post-generation checks claimed by the SOTA models, but are apparently 
considered in the "< 1 in 10,000 completions will generate copyrighted code" 
territory.

When asked a human language prompt, or a multi-agent pipelined "still human 
language but from your architect agent" prompt, the likelihood of producing a 
string of copyrighted code in that manner is statistically very, very low. I 
think we're at far more risk of contributors copy/pasting stack overflow or 
code from other projects than we are from modern genAI models producing blocks 
of copyrighted code.

All of which comes back to: if people disclose if they used AI, what models, 
and whether they used the code or text the model wrote verbatim or used it as a 
scaffolding and then heavily modified everything I think we'll be in a pretty 
good spot.

On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>> 
>>> 2. Models that do not do output filtering to restrict the reproduction of 
>>> training data unless the tool can ensure the output is license compatible?
>>> 
>>> 2 would basically prohibit locally run models.
> 
> I am not for this for the reasons listed above. There isn’t a difference 
> between this and a contributor copying code and sending our way. We still 
> need to validate the code can be accepted .
> 
> We also have the issue of having this be a broad stroke. If the user asked a 
> model to write a test for the code the human wrote, we reject the 
> contribution as they used a local model? This poses very little copywriting 
> risk yet our policy would now reject
> 
> Sent from my iPhone
> 
>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>> 2. Models that do not do output filtering to restrict the reproduction of 
>> training data unless the tool can ensure the output is license compatible?
>> 
>> 2 would basically prohibit locally run models.

Reply via email to