Did some more digging. Apparently the way a lot of headline-grabbers have been making models reproduce code verbatim is to prompt them with dozens of verbatim tokens of copyrighted code as input where completion is then very heavily weighted to regurgitate the initial implementation. Which makes sense; if you copy/paste 100 lines of copyrighted code, the statistically likely completion for that will be that initial implementation.
For local LLM's, the likelihood of verbatim reproduction is *differently* but apparently comparably unlikely because they have far fewer parameters (32B vs. 671B for Deepseek for instance) of their pre-training corpus of trillions (30T in the case of Qwen3-32B for instance), so the individual tokens from the copyrighted material are highly unlikely to be actually *stored* in the model to be reproduced, and certainly not in sequence. They don't have the post-generation checks claimed by the SOTA models, but are apparently considered in the "< 1 in 10,000 completions will generate copyrighted code" territory. When asked a human language prompt, or a multi-agent pipelined "still human language but from your architect agent" prompt, the likelihood of producing a string of copyrighted code in that manner is statistically very, very low. I think we're at far more risk of contributors copy/pasting stack overflow or code from other projects than we are from modern genAI models producing blocks of copyrighted code. All of which comes back to: if people disclose if they used AI, what models, and whether they used the code or text the model wrote verbatim or used it as a scaffolding and then heavily modified everything I think we'll be in a pretty good spot. On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >> >>> 2. Models that do not do output filtering to restrict the reproduction of >>> training data unless the tool can ensure the output is license compatible? >>> >>> 2 would basically prohibit locally run models. > > I am not for this for the reasons listed above. There isn’t a difference > between this and a contributor copying code and sending our way. We still > need to validate the code can be accepted . > > We also have the issue of having this be a broad stroke. If the user asked a > model to write a test for the code the human wrote, we reject the > contribution as they used a local model? This poses very little copywriting > risk yet our policy would now reject > > Sent from my iPhone > >> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >> 2. Models that do not do output filtering to restrict the reproduction of >> training data unless the tool can ensure the output is license compatible? >> >> 2 would basically prohibit locally run models.