[RFCv3] Counter-Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

Thorsten Glaser Fri, 09 May 2025 16:27:32 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA384

On Fri, 9 May 2025, Simon Josefsson wrote:


>> So, with all the updates, maybe something like this?
>
>I read this now, and think it is an improvement so I'll second this
>version too.

OK, thanks.

>I realized that I have one additional generic concern: You claim that
>models are a derivate work of their training input.

Yes. This is easily shown, for example by looking at how they work,
https://explainextended.com/2023/12/31/happy-new-year-15/ explained
this well, and in papers like “Extracting Training Data from ChatGPT”.
It is a sort of lossy compression that has shown to be sufficiently
un-lossy enough (urgs, forgive my lack of English) that recognisable
“training data” can be recalled, and the operators’ “fix” was to add
filters to the prompts, not to make it impossible, because they cannot.

>A small comment:
>
>> ⅱ. Any existing package with a “model” inside that already had the
>>    very same model before 2020-01-01 has an extra four years time
>>    before bugs regarding these models may become release-critical.
>
>Why 2020-01-01?  Couldn't we be generous here and say that if someone
>was in the initial Bookworm release then it is eligible for this
>exception?

At least DALL-E predates bookworm, from a quick search. I’m not
entirely sure when this craze began. If you’d rather have it
expressed in terms of releases I’d choose bullseye here instead,
to be sure.

But, okay, let’s be generous here to cut discussion a bit.


Counter-Proposal -- Interpretation of DFSG on (AI) Models (v3)
=========================================================

Please see the original proposal for background on this.

The counter-proposal is as follows:

The Debian project requires the same level of freedom for AI models
than it does for other works entering the archive.

Notably:

1. A model must be trained only from legally obtained and used works,
   honour all licences of the works used in training, and be licenced
   under a suitable licence itself that allows distribution, or it is
   not even acceptable for non-free. This includes an understanding
   that “generative AI” output are derivative works of their inputs
   (including training data and the prompt), insofar as these pass
   threshold of originality, that is, generative AI acts similar to
   a lossy compression followed by decompression, or to a compiler.

   Any work resulting from generative use of a model can at most be
   as free as the model itself; e.g. programming with a model from
   contrib/non-free assisting prevents the result from entering main.

   The "/usr/share/doc/PACKAGE/copyright" file must include copyright
   notices from all training inputs as required by Policy for “any
   files which are compiled into the object code shipped in the binary
   package”, except for inputs already separately packaged (such as
   the training software, libraries, or inputs already available from
   packages such as word lists also used for spellchecking).

   Regarding availability of sources used for training, the normal
   rules of the non-free archive apply.

2. For a model to enter the contrib archive, it may at runtime require
   components from outside of Debian main (such as drivers for specific
   hardware it is designed to run on), but the model itself (including
   any training input that ends up in the model) must still comply with
   the DFSG, i.e. follow below requirements for models entering main.
   If a model requires a component outside of main at build or training
   time that changes the model itself (e.g. training data, or training
   software part of which ends up in the trained model), it is only
   admissible to non-free.

3. For a model to enter the main archive, all works used in training
   must additionally be available, auditable, and under DFSG-compliant
   licencing. All software used to do the training must be available
   in Debian main.

   If the training happens during package build, the sources must be
   present in Debian packages or in the model’s source packages; if
   not, they must still be available in the same way.

   This is the same rule as is used for other precompiled works in
   Debian packages that are not regenerated during build: they must
   be able to be regenerated using only Debian tools, waiving the
   requirement to actually do the regenerating during package build
   is a nod to realistic build time and resource usage.

4. For a model to enter the main archive, the model training itself
   must *either* happen during package build (which, for models of
   a certain size, may need special infrastructure; the handling of
   this is outside of the scope of this resolution), *or* the model
   resulting from training must build in a sufficiently reproducible
   way that a separate rebuilding effort from the same source will
   result in the same trained model. (This includes using reproducible
   seeds for PRNGs used, etc.)

   For realistic achievability of this goal, the reproducibility
   requirement is relaxed to not require bitwise equality, as long
   as the resulting model is effectively identical. (As a comparison,
   for C programs this would be equivalent to allowing different
   linking order of the object files in the binary or embedded
   timestamps to differ, or a different encoding of the same opcodes
   (like 31 C0 vs. 33 C0 for i386 “xor eax,eax”), but no functional
   changes as determined by experts in the field.)

5. For handling of any large packages resulting in this, the normal
   processes are followed (such as discussing in advance with the
   relevant teams, ensuring mirrors are not over-burdened, etc).

The Debian project asks that training sources are not obtained
unethically, and that the ecological impact of training and using
AI models be considered.

Transitional provisions:

ⅰ. Any bugs resulting from this GR shall not be release-critical
   before Debian trixie has been released as stable.

ⅱ. Any existing package with a “model” inside that already had the very
   same model in the initial bookworm release has an extra four years
   time before bugs regarding these models may become release-critical.

[End of proposal.]


Thanks for the discussion,
//mirabilos
- -- 
<cnuke> den AGP stecker anfeilen, damit er in den slot aufm 440BX board passt…
oder netzteile, an die man auch den monitor angeschlossen hat und die dann für
ein elektrisch aufgeladenes gehäuse gesorgt haben […] für lacher gut auf jeder
LAN party │ <nvb> damals, als der pizzateig noch auf dem monior "gegangen" ist
-----BEGIN PGP SIGNATURE-----

iQIcBAEBCQAGBQJoHozPAAoJEHa1NLLpkAfgr+kQAMHhGac5ieY+8h0yYGUW5dpR
0B6e5d0JsQyaE9wmqBVej+dnGkts7Jtz5T42e2t0AEiXpgNYfLvWUFX6nAjwpDJW
reuvZRzynd2IYVxnadP0J/gX35R8ldqD8VXZFIs0McNsl5pmqxJRioYkB3lRXDjh
McDZwc4LqR3ey6cW6ay7a7NG+ak8N5QAGmSF3y4fYDLVDKZxW73gJqrq81HOBJpp
I76CL+JEpipEQ/AZHB/gdD/ldnc2EdtiHIOn7IpuFLKcgN6LJW9mpIDJ/IcWX2jE
ZZ4lLcGdzhdZb4MovEisSmomkO6VMb+Qs22R34KkMlaOTA9ne5cmUPG3eJDHR7oP
bTTb06AV7C0Mqtn7X0Am/x8R2suFqOu437RXmI+VA4NqMVQmWhOMvY4cQoTMr3h1
VWY/JtxwzcIhZE0WyC9Y6htX4AyGX23aNgCVVuo93w/Kq80e/57fjtuVAlUIesCN
EwyQjou4RYUS/R6mVVeU8FopRR/BlYnu8kb6nBzm6o8oQBOQVZ0eLNIe7Yq2IEv7
Mm1fFpRH8oQfYvbVjx6DCllIshMegCorYd6dBClYIeN+ItbxTWSWln7STHuqfSHi
mhVL05whqvBHdNiAiHtz5Mlc72gp65R6Fi0aH9jxErQ8iNIO9e33mCABnxPvONC7
kDHW9ujNAmPqLlJJIrv1
=RYP4
-----END PGP SIGNATURE-----

[RFCv3] Counter-Proposal -- Interpretation of DFSG on Artificial Intelligence (AI) Models

Reply via email to