Hi Hugo, I'm redirecting the GCD 8 PR comment thread here because linear conversation and talking stick does not work in long form: https://codeberg.org/guix/guix-consensus-documents/pulls/13#issuecomment-16813481
On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote: > [LLM] realizes the request is for a "classic function body" > and returns that. In order to accidentally end up > with problematic code, this needs to happen: > > - The programmer unknowingly made a reference to problematic code, > e.g. the programmer coincidentally selected the same variable > names as John Carmack. The space for indentifier name is not large, and it should not be unlikely for functions doing similar things in different codebases to have similar to identical names. On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote: > - [...] LLM to figure out [...] > - [...] LLM decides [...] I'd refrain from implying deduction capability of language models. See also: https://openreview.net/forum?id=pMhTFUdM4G On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote: > - The code is indeed problematic, e.g. proprietary. Free software licenses need to be complied against too, and we need to know which snippet it's from for that to happen. Quake's fast invert square root is extremely popular, hence it's statistically likely to show up. For something with less competition, it's less clear if an LLM can return the parent project with high accuracy. On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote: > - [...] safeguards from the LLM [service] do not flag this prompt > to deliberately/accidentally get copyrighted code but what is the coverage of the guardrail's heuristics (which is likely implemented as pattern matching on natural language, like what discovered in the recent Anthropic source leak)? On 2026-06-08 at 08:47+02:00, Hugo Buddelmeijer wrote: > > Section II.D.2 (Training/Memorization) > > The only argument seems to be that the training data is "in" the model In citation 117, > OpenAI Reply Comments at 9 n.23 (explaining that pre-trained > language models can, “on rare occasions, ‘memorize’ training data > such that it may output a verbatim excerpt of that data > when prompted with a different portion of that data. On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote: > It seems the only way to use genAI badly is to prompt them > so explicitly that the intent clearly is to get copyrighted > material out. I don't think we need a specific pledge > to not ask genAI for copyrighted material. See githubcopilotlitigation.com, or https://www.courtlistener.com/docket/65669506/doe-1-v-github-inc/ The second amended complaint (200) is sadly heavily redacted, but please pay some attention over 60-74, 84-87, 114, 155-156 within that document. 114 might prove your point of a prompt with _bad intent_, though with all what we know, it could have been def __init__(self, string, number): self.string, self.number = string, number Kind regards, Phong
signature.asc
Description: PGP signature
