On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <z...@debian.org> wrote: > > On Wed, May 14, 2025 at 11:38:02PM +0200, Aigars Mahinovs wrote: > > You would *actually* technically, in reality, prefer digging through > > gigabytes of text files and do some kind of manual modifications in > > that sea of raw data? Modifications that are basically impossible to > > track in any kind of change tracker. That are excessively hard and > > time consuming to actually do and check. Instead of just adjusting > > input parameters on the ingest script? *That* is what I consider to be > > frankly very hard to believe. > > Aigars, I'm sympathetic to your general stance in this debate, but I > think you push it too far, in the following sense. > > It is undeniable that *some* modifications of a trained ML models are > possible starting directly from the model weights. I also personally > agree that, at least for big models, *most* modifications (counted in > terms of use cases and/or users actually doing them) will happen > starting from model weights via techniques like fine tuning. > > But I don't think it is disputable that the *most general* way of > modifying an ML model is achievable only starting from the full training > dataset and pipeline. There are simply things that you cannot do > starting from the trained model.
This is not quite the point I was trying to make in this specific thread. I was pointing out the difference between raw blob of training data and pipeline that creates/gathers that raw blob of training data. The opinion I am trying to argue for here is that *the pipeline* is the *actual* source code. And the raw training data is an intermediate build artifact. The assumption is that the pipeline is robust enough to be re-runnable to gather the raw testing data from non-Debian data sources by either Debian or third parties. For me, in a typical ML project, modifying raw training data after it has been gathered and processed by the automatic ingest pipeline is akin to manually tweaking the intermediate source code that Bison generated instead of modifying the actual source of the grammar. Or insisting that output of autotools *has to* always, legally be part of the package source tarball. I don't think it is a great idea (technically speaking) to have in Debian an AI model whose declared ingest pipeline actually goes out and crawls millions of webpages. And we would be well positioned to say that links to, for example, oxylabs.io data sets do not fit our needs (you need a ~1k/month license to download their datasets). But I do think that it should be perfectly fine to have an ingest pipeline that simply downloads " https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz " for example. WARNING - the link contains 99 TiB file that decompresses to 468 TiB of uncompressed content! That is the kind of OSI-free AI models that I am arguing for - the source data is clearly identified and is generally available, *but* it is not fit to be redistributed by Debian or its mirrors or its derivative distributions. -- Best regards, Aigars Mahinovs