On Thu, 15 May 2025 at 10:06, Stefano Zacchiroli <z...@debian.org> wrote:
>
> On Wed, May 14, 2025 at 11:38:02PM +0200, Aigars Mahinovs wrote:
> > You would *actually* technically, in reality, prefer digging through
> > gigabytes of text files and do some kind of manual modifications in
> > that sea of raw data? Modifications that are basically impossible to
> > track in any kind of change tracker. That are excessively hard and
> > time consuming to actually do and check. Instead of just adjusting
> > input parameters on the ingest script? *That* is what I consider to be
> > frankly very hard to believe.
>
> Aigars, I'm sympathetic to your general stance in this debate, but I
> think you push it too far, in the following sense.
>
> It is undeniable that *some* modifications of a trained ML models are
> possible starting directly from the model weights. I also personally
> agree that, at least for big models, *most* modifications (counted in
> terms of use cases and/or users actually doing them) will happen
> starting from model weights via techniques like fine tuning.
>
> But I don't think it is disputable that the *most general* way of
> modifying an ML model is achievable only starting from the full training
> dataset and pipeline. There are simply things that you cannot do
> starting from the trained model.

This is not quite the point I was trying to make in this specific
thread. I was pointing out the difference between raw blob of training
data and pipeline that creates/gathers that raw blob of training data.

The opinion I am trying to argue for here is that *the pipeline* is
the *actual* source code. And the raw training data is an intermediate
build artifact. The assumption is that the pipeline is robust enough
to be re-runnable to gather the raw testing data from non-Debian data
sources by either Debian or third parties.

For me, in a typical ML project, modifying raw training data after it
has been gathered and processed by the automatic ingest pipeline is
akin to manually tweaking the intermediate source code that Bison
generated instead of modifying the actual source of the grammar. Or
insisting that output of autotools *has to* always, legally be part of
the package source tarball.

I don't think it is a great idea (technically speaking) to have in
Debian an AI model whose declared ingest pipeline actually goes out
and crawls millions of webpages.
And we would be well positioned to say that links to, for example,
oxylabs.io data sets do not fit our needs (you need a ~1k/month
license to download their datasets).

But I do think that it should be perfectly fine to have an ingest
pipeline that simply downloads "
https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/warc.paths.gz
" for example.
WARNING - the link contains 99 TiB file that decompresses to 468 TiB
of uncompressed content!

That is the kind of OSI-free AI models that I am arguing for - the
source data is clearly identified and is generally available, *but* it
is not fit to be redistributed by Debian or its mirrors or its
derivative distributions.
-- 
Best regards,
    Aigars Mahinovs

Reply via email to