+1 to what Josh said Sent from my iPhone
> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote: > > > Did some more digging. Apparently the way a lot of headline-grabbers have > been making models reproduce code verbatim is to prompt them with dozens of > verbatim tokens of copyrighted code as input where completion is then very > heavily weighted to regurgitate the initial implementation. Which makes > sense; if you copy/paste 100 lines of copyrighted code, the statistically > likely completion for that will be that initial implementation. > > For local LLM's, the likelihood of verbatim reproduction is differently but > apparently comparably unlikely because they have far fewer parameters (32B > vs. 671B for Deepseek for instance) of their pre-training corpus of trillions > (30T in the case of Qwen3-32B for instance), so the individual tokens from > the copyrighted material are highly unlikely to be actually stored in the > model to be reproduced, and certainly not in sequence. They don't have the > post-generation checks claimed by the SOTA models, but are apparently > considered in the "< 1 in 10,000 completions will generate copyrighted code" > territory. > > When asked a human language prompt, or a multi-agent pipelined "still human > language but from your architect agent" prompt, the likelihood of producing a > string of copyrighted code in that manner is statistically very, very low. I > think we're at far more risk of contributors copy/pasting stack overflow or > code from other projects than we are from modern genAI models producing > blocks of copyrighted code. > > All of which comes back to: if people disclose if they used AI, what models, > and whether they used the code or text the model wrote verbatim or used it as > a scaffolding and then heavily modified everything I think we'll be in a > pretty good spot. > >> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote: >> >>> 2. Models that do not do output filtering to restrict the reproduction of >>> training data unless the tool can ensure the output is license compatible? >>> >>> 2 would basically prohibit locally run models. >> >> >> I am not for this for the reasons listed above. There isn’t a difference >> between this and a contributor copying code and sending our way. We still >> need to validate the code can be accepted . >> >> We also have the issue of having this be a broad stroke. If the user asked a >> model to write a test for the code the human wrote, we reject the >> contribution as they used a local model? This poses very little copywriting >> risk yet our policy would now reject >> >> Sent from my iPhone >> >>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>> 2. Models that do not do output filtering to restrict the reproduction of >>> training data unless the tool can ensure the output is license compatible? >>> >>> 2 would basically prohibit locally run models. >