IMHO we are here having a very annoying mixture of technical, legal and
philosophical problems.

Hypothetical 1: Bob reads all programming manuals and all DFSG-free code in
Debian and GitHub and teaches themselves Python programming. They are asked
to solve a simple problem. Their answer basically matches sample solutions
from a few Python coding manuals.

Can Bob release this solution as DFSG-free code? Does it matter if the
specific programming manual or Python course manual was DFSG-free licensed
or not? Does it matter if the manual had a GPL license? What if they
learned in a university setting?

Hypothetical 2: An abstract AI Alice does the same learning process as Bob
and produces the same output in an answer to the same request.

Do the conditions on the output of Alice change? Is the change technical or
legal/philosophical? You could call this a Turing test for copyright.


Processing of experiences into expert opinion is IMHO not directly
comparable with compilation of source to a binary. Regardless if it's done
by a human or a software system. The copyright law makes a distinction here
for humans. And while no explicit legal precedent is yet set for any kinds
of AI (including LLMs), the very lack of massive copyright violation
lawsuits from very sue-happy corporations, like Disney, is already a
noteworthy precedent. If LLMs from Meta and OpenAI (and others) are not
being sued for massive copyright violations, then it is the consensus of
our society and of our legal system that the same kind of expert opinion /
learning protections that humans enjoy also seem to apply to complex-enough
artificial expert systems. One hand-wavy legal loophole could be that the
learning process splits the copyrighted works into chunks small enough that
none of those chunks would legally retain the copyright protection anymore.
But that is just one of many speculations until a law or a court
establishes such guidelines.

We, as Debian (and as a free software community), do not *have to* be
satisfied with such a status quo, but that can be a starting position. And
this is in *stark* contrast to our existing DFSG framework that is
operating in a very strict and very well defined copyright legal framework
and it has been legislated in all kinds of courts of law over the centuries.

Use of AI for DFSG-free work is also a related topic that can be solved
this way. If the AI output is not derived from its learning material in
copyright law context, then there are zero issues in using AI assistance
tools in developing DFSG-free software. Then using, even a non-free AI
assistant tooling in the development process is equivalent to using a
non-free IDE: it might be icky, but it for sure does not render the
resulting code non-free.

What does that mean in terms of this proposal (or a potential alternative
proposal)?

If we take as a given that copyright does *not* survive the learning
process of a (sufficiently complex) AI system, then it is *not* necessary
that all training *data* for training a DFSG-free AI to also be DFSG-free.
It is however necessary that:
* software needed for inference (usage) of the AI model to be DFSG-free
* software needed for the training process of the AI model to be DFSG-free
* software needed to gather, assemble and process the training data to be
DFSG-free or the manual process for it to be documented

In this perspective, we would be seeing the training data itself as
immutable and uncopyrightable facts of world and nature, like positions and
spectra of stars in the sky (because its copyright does not survive the
learning process). It is data that can be gathered again, maybe with slight
variation in results and it does not really change based on who does the
gathering (assuming similar resources get invested).

With such approach it would allow:
* entrance into main the AI models that are (at least technically)
re-creatable with free software (and a world to be observed)
* clear exclusion of AI that is not free (from the software perspective)
* clear guidelines on improvements in process description and/or
documentation for expert-derived binary data already in the archive
* clear position on (unenforceable) usage of AI assistant tooling in
development of DFSG-free software


*If* the legal opinion of the society at large changes over time and courts
rule that Meta/OpenAI/... actually *do* violate copyright in the process of
creating and training their AI models or providing their output, then
*that* would be a good time for Debian to re-evaluate this position.

On Wed, 23 Apr 2025 at 12:33, Matthias Urlichs <matth...@urlichs.de> wrote:

> On 22.04.25 22:59, Ansgar 🙀 wrote:
>
> But the practical effects of passing the GR is probably (among other
> things):
>
> a) Removal of OCR software (like tesseract[1])
> b) Removal of image recognition software (like opencv[2])
> c) Possibly removal of text-to-speech software (like festival[3] or
> flite[4])
>
> You might want to write a counter-proposal B, then.
>
> Or even a proposal C that's more nuanced.
>
> I mean, with the right prompt you can get many AI models to regurgitate
> some of the texts or images they've been trained with. TTBOMK it's
> mostly-impossible to do that with Tesseract or OpenCV.
>
> NB, do we really need to *remove* these packages? or maybe just move them
> to contrib, and their model files to non-free?
>
> --
> -- regards
> --
> -- Matthias Urlichs
>
>

-- 
Best regards,
    Aigars Mahinovs

Reply via email to