On Sun, 4 May 2025 at 13:12, Wouter Verhelst <wou...@debian.org> wrote:

> On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote:
> >    However, here we have a clear and fundamental change happening in the
> >    copyright law level - there is a legal break/firewall that is
> happening
> >    during training. The model *is* a derivative work of the source code
> of
> >    the training software, but is *not* a derivative work of the training
> >    data.
>
> I would disagree with this statement. How is a model not a derivative
> work of the training data? Wikipedia defines it as
>
>
The simple fact that none of the LLMs have been sued out of existence by *any
*copyright owner is de facto proof that it does not work that way in the
eyes of the judicial system. Wikipedia definition is a layman's
simplification. The actual law is detailed by various decisions in a huge
number of different court cases and differs
significantly between jurisdictions as well.

A programmer also is able to reproduce code that they have seen before in a
way that would constitute a copyright violation. That does not mean that
ALL output of that programmer will constitute derivative work of all code
they have seen in their living experience. This shows that there is some
step or process that interrupts the chain of derivation for
copyright purposes. There is no real reason why a process done by a human
brain is legally different from a process made in a computer.

Take for example the common process of black-box reimplementation:
* team 1 reads the original code and writes a specification document on how
the software works
* team 2 reads the specification and implements a new program that
implements this specification

Nowadays nothing technically prevents both of those tasks to be done by
software - you can write software that analyses the source code and writes
a functional specification. And also you can write software that implements
software from a functional specification. In any language. Whether it is
done by software or humans, the original copyright will no longer apply to
the output. It will not be, legally, a derived work. It helps if the
process is not fully deterministic, for example if there is testing and
refinement and evolution of code happening in between. Not reproducing the
source exactly is an easy judgement that can be done by a separate
component to reject code that is too similar to the original (even if it
was written independently). This is also sometimes used in human rewrites,
for the same reason.

The same logic applies very cleanly to a LLM - the training is the first
step of the transformation and using the LLM is the second step.

-- 
Best regards,
    Aigars Mahinovs

Reply via email to