Re: Package Updates and additions should mentions if LLM is used in the software or not

pelzflorian (Florian Pelz) Thu, 07 May 2026 03:03:23 -0700

Greg Hogan <[email protected]> writes:

> On Tue, May 5, 2026 at 5:45 PM pinoaffe <[email protected]> wrote:
>>
>> Greg Hogan <[email protected]> writes:
>> > LLM output is licensable so your concerns are allayed.
>>
>> The jury is very much still out on that:
>> - this has yet to be decided on by courts
>> - even if one particular jurisdiction decides on this matter, there are
>>   *many* more relevant jurisdictions
>> - and even if suddenly all jurisdictions were to agree on some specific
>>     copyright-status for LLM output, we as a community would *still*
>>     need to decide whether we want to recognize that and incorporate it
>
> Yes, but I think we would want to retain our free software identity.
> We don't want to sell our soul in opposition to a new technology.
>
>> There are many possible interpretations of the copyrightability of LLM
>> output, a selection:
>> - llm output is generally intellectual property of the user
>> - llm output is generally intellectual property of the organisation 
>> "hosting" the llm
>> - llm output is generally public domain
>> - llm output is neither anyones IP nor in the public domain
>> - etc
>
> None of those options makes any difference to us. Present one case
> where this is a problem for free software.
>
>> And even if llm output is generally thought to be licensable, this
>> clearly cannot apply to any near-perfect copies of some part of its
>> training data that it may randomly emit, so incorporating llm output
>> into a GPL project would likely still be a legal risk
>
> This is not happening in 2026. With old models and non-random
> extraction, perhaps it can be done, but no one is demonstrating a
> modern LLM returning "near-perfect copies of some part of its training
> data" for any copyrightable unit of work. Just as with crypto where
> important research is done on weakened algorithms (reduced iterations)
> the demonstrations of targeted extraction and fine-tuning is reducing
> our risk as mitigations are developed and applied.


The ongoing GEMA suit is an example where the LLM used to print near-verbatim
song lyrics [1].  Generally, I remember Ekaitz’ suggestion in a mail
from March [2] to add to the manual these words:

- If a significant portion of your contribution (i.e. beyond simple
  autocomplete) was copied from somewhere else (i.e. AI, a website,
  another software project...) you are required to disclose it in the PR
  description.
- If you cannot guarantee the provenance and legal safety of your code,
  do not submit it.

from [2].

But my worry is that the agents (more than LLMs) obfuscate when they
steal.  That people will not know when their LLM contribution to Guix is
just a Scheme translation of other peoples’ copyrighted Rust code or
written by clickworkers.

Even though LLMs clearly show some intelligence of their own when
figuring out the LEAN Github code for Erdős problems referenced in [3],
which then would clearly be usable public-domain code.

Regards,
Florian

[1]
https://en.wikipedia.org/wiki/Artificial_intelligence_and_copyright#GEMA_v._OpenAI,_Inc.
[2]
https://lists.gnu.org/archive/html/guix-devel/2026-03/msg00102.html
[3]
https://arxiv.org/abs/2601.07421

Re: Package Updates and additions should mentions if LLM is used in the software or not

Reply via email to