I'm not going to discuss the other points of your mail with you, as you
frequently present beliefs as facts, and have attacked, in unacceptable
conduct, other people who oppose these.
I'm going to address some of your technical arguments just do
demonstrate that this is issue isn't black and white as you present it,
but extremely complicated. I'm doing it for the benefit of the list.
First, I acknowledge that an LLM regurgitating text is a problem -- one
of many problems.
On 2026-02-19 05:47, Thorsten Glaser wrote:
> And please, do not use the word “generate”, they don’t generate
> (generative art is something entirely different and good), they
> regurgitate.
Of course they can be generative. A simple counterexample to your
argument is a model that hasn't been trained on data yet, IOW has
weights still randomly initialized.
The generative aspect is also visible when you input nonsensical data or
data that the model hasn't seen yet. You'll still get a sequence out of
it.
The generative aspect is also visible in the thinking models, which
generate "thinking" tokens that aren't part of text that a normal human
would even write.
> LLMs are a sort of lossy compressor/decompressor, with
> the decompression attempting a best *average* match to continue the
> “prompt” (it’s really just autocomplete with sparks).
You literally just made the statistical argument ("best *average*
match") that the proponents are also making.
I don't think it's just "autocomplete with sparks" anymore (early
ChatGPT was), but note this trivialization is also a counterargument to
the copyright argument you are making here.
> HOWEVER, this does not mean that their output is free from copyright.
> Rather, due to the above-mentioned properties (machine transformation
> of copyrighted works), the sum of all outputs from such a model is a
> derived work from all of its inputs (and for how much this is true for
> each individual combination of input and output of course depends on
> the prompt, PRNG seed and output in question).
In the case where an LLM does not reproduce some existing text:
Given that these models are trained on insane amounts of text, any
single training example most likely contributes only an infinitesimal
amount to the statistical average you speak of.
Or, formulated differently: if you take any one example out, the model
will most likely not change meaningfully or even at all.
Problem #1: How could one argue for copyright of a particualr example,
if removing that example from model training has no meaningful impact?
Problem #2: If you want to argue that even infinitesimally small
contributions to a model trigger copyright protection, then every single
human being who was read and learned from other people's code before is
guilty of this as well.
Whether consciously or subconsciously, other people's code affect your
own code, even if only in a minuscule way. And that's OK!
But from a legal point of view, why would one hold a model to some
higher standard? If you don't see the requirement to cite every single
work that has possibly influenced your own, even in an infinitesimal
way, then under what legal doctrine would this be different for a model?
> This does not, of course,
> give you carte blanche to just use *any* of its output… not even small
> ones. Citing rules do exist, after all. Especially the academics should
> know some…
I asked an LLM to correct spelling and grammar of the following sentence
for me, which I obviously made up entirely to prove a point:
"The apphoristic submareeen viciously, crawled, thru my favorite
virtual sandwhich."
It produced:
"The aphoristic submarine viciously crawled through my favorite
virtual sandwich."
As a thought example: who's copyright could have been violated here, if
I use this example? Are spelling or grammar copyrighted? Why would
anyone need to be cited for this?
To get back to the generative argument, I asked Gemini to continue the
nonsensical sentence above [1]. Again, who's copyright is being
violated here?
As I said on -private, there are numerous more thought-provoking
examples which demonstrate that this issue is not as clear as you
present it, all while acknowledging that (serious) problems do exist.
But where hasn't this been the case with a new revolutionary technology?
I think that as with any other revolutionary technology, we should
actively contribute and thus help shape it. For example, I was so happy
to see this at DebConf25 [2].
What saddens me the most though is your argument as if this is
Capitalism just exploiting everyone, because you've clearly not yet
experienced how transformative LLMs can be to the average person,
especially to the computer-illiterate who (by no fault of their own) do
not possess the skills to do anything technical.
Christian
[1]: https://kagi.com/assistant/162d9686-e2b5-4242-879b-44e8a7451d2b
[2]:
https://debconf25.debconf.org/talks/117-apt-install-lucie-from-source-lucie-the-fully-libre-llm-you-can-build-hack-and-trust/