On Tue, Apr 29, 2025 at 03:17:52PM +0200, Aigars Mahinovs wrote: > However, here we have a clear and fundamental change happening in the > copyright law level - there is a legal break/firewall that is happening > during training. The model *is* a derivative work of the source code of > the training software, but is *not* a derivative work of the training > data.
I would disagree with this statement. How is a model not a derivative work of the training data? Wikipedia defines it as In copyright law, a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work). [1] Which, as models are often able to regurgitate copyrighted works (largely) verbatim, is to me a definition that applies to models. [1] https://en.wikipedia.org/wiki/Derivative_work > This means that we also have to consider what exactly is training > data and how to deal with it, without automatically falling back to > equating it with source code. We have a very wide definition of "source code" in Debian. To us, source code is not limited to software written in a common programming language; instead, our definition considers various things such as SVG files, libreoffice documents, gimp XCF files, etc, to be source code too. In this context, I don't think that equating training data to source code is too wild a thing to do. -- w@uter.{be,co.za} wouter@{grep.be,fosdem.org,debian.org} I will have a Tin-Actinium-Potassium mixture, thanks.