+1 to what Josh said
Sent from my iPhone

> On Jun 25, 2025, at 1:18 PM, Josh McKenzie <jmcken...@apache.org> wrote:
> 
> 
> Did some more digging. Apparently the way a lot of headline-grabbers have 
> been making models reproduce code verbatim is to prompt them with dozens of 
> verbatim tokens of copyrighted code as input where completion is then very 
> heavily weighted to regurgitate the initial implementation. Which makes 
> sense; if you copy/paste 100 lines of copyrighted code, the statistically 
> likely completion for that will be that initial implementation.
> 
> For local LLM's, the likelihood of verbatim reproduction is differently but 
> apparently comparably unlikely because they have far fewer parameters (32B 
> vs. 671B for Deepseek for instance) of their pre-training corpus of trillions 
> (30T in the case of Qwen3-32B for instance), so the individual tokens from 
> the copyrighted material are highly unlikely to be actually stored in the 
> model to be reproduced, and certainly not in sequence. They don't have the 
> post-generation checks claimed by the SOTA models, but are apparently 
> considered in the "< 1 in 10,000 completions will generate copyrighted code" 
> territory.
> 
> When asked a human language prompt, or a multi-agent pipelined "still human 
> language but from your architect agent" prompt, the likelihood of producing a 
> string of copyrighted code in that manner is statistically very, very low. I 
> think we're at far more risk of contributors copy/pasting stack overflow or 
> code from other projects than we are from modern genAI models producing 
> blocks of copyrighted code.
> 
> All of which comes back to: if people disclose if they used AI, what models, 
> and whether they used the code or text the model wrote verbatim or used it as 
> a scaffolding and then heavily modified everything I think we'll be in a 
> pretty good spot.
> 
>> On Wed, Jun 25, 2025, at 12:47 PM, David Capwell wrote:
>> 
>>> 2. Models that do not do output filtering to restrict the reproduction of 
>>> training data unless the tool can ensure the output is license compatible?
>>> 
>>> 2 would basically prohibit locally run models.
>> 
>> 
>> I am not for this for the reasons listed above. There isn’t a difference 
>> between this and a contributor copying code and sending our way. We still 
>> need to validate the code can be accepted .
>> 
>> We also have the issue of having this be a broad stroke. If the user asked a 
>> model to write a test for the code the human wrote, we reject the 
>> contribution as they used a local model? This poses very little copywriting 
>> risk yet our policy would now reject
>> 
>> Sent from my iPhone
>> 
>>>> On Jun 25, 2025, at 9:10 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>> 2. Models that do not do output filtering to restrict the reproduction of 
>>> training data unless the tool can ensure the output is license compatible?
>>> 
>>> 2 would basically prohibit locally run models.
> 

Reply via email to