On Thu, 8 May 2025 at 12:46, Wouter Verhelst <wou...@debian.org> wrote: > > On Tue, May 06, 2025 at 12:02:08AM +0200, Aigars Mahinovs wrote: > > The transformative criteria here is that the resulting work needs to be > > transformed in such a way that it adds value. And generating new texts > > from a LLM is pretty clearly a value-adding transformation compared to > > the original articles. Even more so than the already ruled-on Google > > Books case. > > OK, let me change it around a bit, because I don't think this discussion > is going in any direction that is relevant for Debian. > > The only way in which you can build a model is by taking loads and loads > of data, running some piece of software over it, and storing the result > somewhere. > > How can we do this legally, reproducibly, and openly if we do not have > the rights to redistribute the said "loads and loads of data"? > > The answer is, we can't.
Sure we can. It is a technical problem, actually. As long as the data is still available, you can store and redistribute information about which data you gathered, from where and how it looked like - hashes of copyrigthed content are not copyrighted ;) We don't *need to* redistribute the data itself. In a more organised structure a developer of an LLM would simply write down that they used the "Reddit all comments corpus" version 20250301-2 with sha256 hashsum of XXX and available over this magnet link (the link itself is just a dressed up hashsum). This is a fully sufficient Training Data Information input to allow a different developer to acquire the same data set (or a newer version of the same data set if they wish to) and proceed to conduct the same training. Saying "get the latest https://dumps.wikimedia.org/enwiki/latest/ dump" (or a live text download/dump from any other public website) is no different technically, just makes every recreation to use the newest state of the source data instead of a frozen snapshot. Might be sub-optimal for stable. But then we have this problem anyway with many data sets or software packages that do not really make sense in a frozen state after a few months or years (like virus definitions). Debian or the developers in question do NOT need to have the legal rights to *redistribute* this data. They only need to have the rights to acquire it and to use it for training. Which is (expected to be) covered by fair use exception in the USA law and by data mining exception in EU law. The whole point of the OSI definition is to make sure that a skilled person with enough resources *does* have enough information available to retrace the steps that created the model. > Therefore, I conclude that, practically, we cannot include models in > Debian if we want them to be reproducible. Adding reproducibility to DFSG as criteria for software to become non-free would be a *very* different GR. > The fact that the model does something vaguely and remotely similar to a > biological process of training and learning in humans, and that > therefore some people have taken to naming the process of running > advanced statistical analysis over data to build such a model also > "training" is a red herring. The two processes are very different and > cannot be compared as a practical matter. It is very much training. A LLM does not memorise or copy or compress its inputs. It *learns* the statistical probabilities of certain words following certain other words in a certain context. That is literally the only thing that the LLM model is - a list of propabilities. It does not *understand* what it is learning - it does not construct an internal model of the world, of objects in it and of their interactions, but it is for sure learning. -- Best regards, Aigars Mahinovs