IMHO we are here having a very annoying mixture of technical, legal and philosophical problems.
Hypothetical 1: Bob reads all programming manuals and all DFSG-free code in Debian and GitHub and teaches themselves Python programming. They are asked to solve a simple problem. Their answer basically matches sample solutions from a few Python coding manuals. Can Bob release this solution as DFSG-free code? Does it matter if the specific programming manual or Python course manual was DFSG-free licensed or not? Does it matter if the manual had a GPL license? What if they learned in a university setting? Hypothetical 2: An abstract AI Alice does the same learning process as Bob and produces the same output in an answer to the same request. Do the conditions on the output of Alice change? Is the change technical or legal/philosophical? You could call this a Turing test for copyright. Processing of experiences into expert opinion is IMHO not directly comparable with compilation of source to a binary. Regardless if it's done by a human or a software system. The copyright law makes a distinction here for humans. And while no explicit legal precedent is yet set for any kinds of AI (including LLMs), the very lack of massive copyright violation lawsuits from very sue-happy corporations, like Disney, is already a noteworthy precedent. If LLMs from Meta and OpenAI (and others) are not being sued for massive copyright violations, then it is the consensus of our society and of our legal system that the same kind of expert opinion / learning protections that humans enjoy also seem to apply to complex-enough artificial expert systems. One hand-wavy legal loophole could be that the learning process splits the copyrighted works into chunks small enough that none of those chunks would legally retain the copyright protection anymore. But that is just one of many speculations until a law or a court establishes such guidelines. We, as Debian (and as a free software community), do not *have to* be satisfied with such a status quo, but that can be a starting position. And this is in *stark* contrast to our existing DFSG framework that is operating in a very strict and very well defined copyright legal framework and it has been legislated in all kinds of courts of law over the centuries. Use of AI for DFSG-free work is also a related topic that can be solved this way. If the AI output is not derived from its learning material in copyright law context, then there are zero issues in using AI assistance tools in developing DFSG-free software. Then using, even a non-free AI assistant tooling in the development process is equivalent to using a non-free IDE: it might be icky, but it for sure does not render the resulting code non-free. What does that mean in terms of this proposal (or a potential alternative proposal)? If we take as a given that copyright does *not* survive the learning process of a (sufficiently complex) AI system, then it is *not* necessary that all training *data* for training a DFSG-free AI to also be DFSG-free. It is however necessary that: * software needed for inference (usage) of the AI model to be DFSG-free * software needed for the training process of the AI model to be DFSG-free * software needed to gather, assemble and process the training data to be DFSG-free or the manual process for it to be documented In this perspective, we would be seeing the training data itself as immutable and uncopyrightable facts of world and nature, like positions and spectra of stars in the sky (because its copyright does not survive the learning process). It is data that can be gathered again, maybe with slight variation in results and it does not really change based on who does the gathering (assuming similar resources get invested). With such approach it would allow: * entrance into main the AI models that are (at least technically) re-creatable with free software (and a world to be observed) * clear exclusion of AI that is not free (from the software perspective) * clear guidelines on improvements in process description and/or documentation for expert-derived binary data already in the archive * clear position on (unenforceable) usage of AI assistant tooling in development of DFSG-free software *If* the legal opinion of the society at large changes over time and courts rule that Meta/OpenAI/... actually *do* violate copyright in the process of creating and training their AI models or providing their output, then *that* would be a good time for Debian to re-evaluate this position. On Wed, 23 Apr 2025 at 12:33, Matthias Urlichs <matth...@urlichs.de> wrote: > On 22.04.25 22:59, Ansgar 🙀 wrote: > > But the practical effects of passing the GR is probably (among other > things): > > a) Removal of OCR software (like tesseract[1]) > b) Removal of image recognition software (like opencv[2]) > c) Possibly removal of text-to-speech software (like festival[3] or > flite[4]) > > You might want to write a counter-proposal B, then. > > Or even a proposal C that's more nuanced. > > I mean, with the right prompt you can get many AI models to regurgitate > some of the texts or images they've been trained with. TTBOMK it's > mostly-impossible to do that with Tesseract or OpenCV. > > NB, do we really need to *remove* these packages? or maybe just move them > to contrib, and their model files to non-free? > > -- > -- regards > -- > -- Matthias Urlichs > > -- Best regards, Aigars Mahinovs