On Sun, 4 May 2025 at 13:12, Wouter Verhelst <wou...@debian.org> wrote:
> On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote: > > However, here we have a clear and fundamental change happening in the > > copyright law level - there is a legal break/firewall that is > happening > > during training. The model *is* a derivative work of the source code > of > > the training software, but is *not* a derivative work of the training > > data. > > I would disagree with this statement. How is a model not a derivative > work of the training data? Wikipedia defines it as > > The simple fact that none of the LLMs have been sued out of existence by *any *copyright owner is de facto proof that it does not work that way in the eyes of the judicial system. Wikipedia definition is a layman's simplification. The actual law is detailed by various decisions in a huge number of different court cases and differs significantly between jurisdictions as well. A programmer also is able to reproduce code that they have seen before in a way that would constitute a copyright violation. That does not mean that ALL output of that programmer will constitute derivative work of all code they have seen in their living experience. This shows that there is some step or process that interrupts the chain of derivation for copyright purposes. There is no real reason why a process done by a human brain is legally different from a process made in a computer. Take for example the common process of black-box reimplementation: * team 1 reads the original code and writes a specification document on how the software works * team 2 reads the specification and implements a new program that implements this specification Nowadays nothing technically prevents both of those tasks to be done by software - you can write software that analyses the source code and writes a functional specification. And also you can write software that implements software from a functional specification. In any language. Whether it is done by software or humans, the original copyright will no longer apply to the output. It will not be, legally, a derived work. It helps if the process is not fully deterministic, for example if there is testing and refinement and evolution of code happening in between. Not reproducing the source exactly is an easy judgement that can be done by a separate component to reject code that is too similar to the original (even if it was written independently). This is also sometimes used in human rewrites, for the same reason. The same logic applies very cleanly to a LLM - the training is the first step of the transformation and using the LLM is the second step. -- Best regards, Aigars Mahinovs