On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
[LLM] realizes the request is for a "classic function body"
and returns that. In order to accidentally end up
with problematic code, this needs to happen:
- The programmer unknowingly made a reference to problematic code,
e.g. the programmer coincidentally selected the same variable
names as John Carmack.
The space for indentifier name is not large, and it should not
be unlikely for functions doing similar things in different codebases
to have similar to identical names.
On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
- [...] LLM to figure out [...]
- [...] LLM decides [...]
I'd refrain from implying deduction capability of language models.
See also: https://openreview.net/forum?id=pMhTFUdM4G
On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
- The code is indeed problematic, e.g. proprietary.
Free software licenses need to be complied against too,
and we need to know which snippet it's from for that to happen.
Quake's fast invert square root is extremely popular,
hence it's statistically likely to show up. For something
with less competition, it's less clear if an LLM can return
the parent project with high accuracy.
On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
- [...] safeguards from the LLM [service] do not flag this prompt
to deliberately/accidentally get copyrighted code
but what is the coverage of the guardrail's heuristics
(which is likely implemented as pattern matching on natural language,
like what discovered in the recent Anthropic source leak)?
On 2026-06-08 at 08:47+02:00, Hugo Buddelmeijer wrote:
Section II.D.2 (Training/Memorization)
The only argument seems to be that the training data is "in" the model
In citation 117,
OpenAI Reply Comments at 9 n.23 (explaining that pre-trained
language models can, “on rare occasions, ‘memorize’ training data
such that it may output a verbatim excerpt of that data
when prompted with a different portion of that data.
On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
It seems the only way to use genAI badly is to prompt them
so explicitly that the intent clearly is to get copyrighted
material out. I don't think we need a specific pledge
to not ask genAI for copyrighted material.
See githubcopilotlitigation.com, or
https://www.courtlistener.com/docket/65669506/doe-1-v-github-inc/
The second amended complaint (200) is sadly heavily redacted,
but please pay some attention over 60-74, 84-87, 114,
155-156 within that document.
114 might prove your point of a prompt with _bad intent_,
though with all what we know, it could have been
def __init__(self, string, number):
self.string, self.number = string, number
Kind regards,
Phong