Re: [GCD 008]: LLM and copyright infringments

Development of GNU Guix and the GNU System distribution. Mon, 08 Jun 2026 02:00:02 -0700

Hi Hugo,

I'm redirecting the GCD 8 PR comment thread here because
linear conversation and talking stick does not work in long form:
https://codeberg.org/guix/guix-consensus-documents/pulls/13#issuecomment-16813481

On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
> [LLM] realizes the request is for a "classic function body"
> and returns that.  In order to accidentally end up
> with problematic code, this needs to happen:
>
> - The programmer unknowingly made a reference to problematic code,
>   e.g.  the programmer coincidentally selected the same variable
>   names as John Carmack.

The space for indentifier name is not large, and it should not
be unlikely for functions doing similar things in different codebases
to have similar to identical names.

On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
> - [...] LLM to figure out [...]
> - [...] LLM decides [...]

I'd refrain from implying deduction capability of language models.
See also: https://openreview.net/forum?id=pMhTFUdM4G

On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
> - The code is indeed problematic, e.g. proprietary.

Free software licenses need to be complied against too,
and we need to know which snippet it's from for that to happen.
Quake's fast invert square root is extremely popular,
hence it's statistically likely to show up.  For something
with less competition, it's less clear if an LLM can return
the parent project with high accuracy.

On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
> - [...] safeguards from the LLM [service] do not flag this prompt
>   to deliberately/accidentally get copyrighted code

but what is the coverage of the guardrail's heuristics
(which is likely implemented as pattern matching on natural language,
like what discovered in the recent Anthropic source leak)?

On 2026-06-08 at 08:47+02:00, Hugo Buddelmeijer wrote:
> > Section II.D.2 (Training/Memorization)
>
> The only argument seems to be that the training data is "in" the model

In citation 117,

> OpenAI Reply Comments at 9 n.23 (explaining that pre-trained
> language models can, “on rare occasions, ‘memorize’ training data
> such that it may output a verbatim excerpt of that data
> when prompted with a different portion of that data.

On 2026-05-12 at 14:40+02:00, Hugo Buddelmeijer wrote:
> It seems the only way to use genAI badly is to prompt them
> so explicitly that the intent clearly is to get copyrighted
> material out.  I don't think we need a specific pledge
> to not ask genAI for copyrighted material.

See githubcopilotlitigation.com, or
https://www.courtlistener.com/docket/65669506/doe-1-v-github-inc/

The second amended complaint (200) is sadly heavily redacted,
but please pay some attention over 60-74, 84-87, 114,
155-156 within that document.

114 might prove your point of a prompt with _bad intent_,
though with all what we know, it could have been

def __init__(self, string, number):
    self.string, self.number = string, number

Kind regards,
Phong

signature.asc
Description: PGP signature

Re: [GCD 008]: LLM and copyright infringments

Reply via email to