Let's consider a machine learning system consisting of two parts:

1. Training program. It takes a dataset and produces a trained model. A
trained model is usually stored as a few serialized arrays of floating
point numbers.
2. Inference program. This one takes a pre-trained model and some data as
input and produces some output based on them.

Both training and inference programs are free software licensed under GPL.

Let's also suppose that the there is a model that is a result of execution
of the training program taking as input some publicly available dataset.
This dataset (for example a set of labeled photographs or natural texts) is
permitted by the publisher to be unlimitedly used for training machine
learning models, and usage of the trained models is not restricted by the
dataset publisher.

So the entire system is distributed as: training program with sources,
inference program with sources, training dataset, trained model.

Someone could take this system, modify the training program, and train a
new model on the same dataset. Then he or she could publish only inference
program with sources, the unmodified training dataset, and the new trained
model. Because the end user doesn't need the modified training program to
run the inference program with the new model, it is not distributed,
because technically the only user of the modified training program is those
who trained a new model using it, so GPL doesn't require to distribute it.

However, in this case freedom of users of the distributed system (inference
program and the new model) is violated because they can't retrain the model
on new data or improve the training code and retrain it on the same data to
improve performance of the model.

My question is how is it possible to protect users' freedom by making
everyone who distributes a trained model to distribute also sources of a
training program that was used to train the model, and instructions for
obtaining the training dataset?

Could the problem be solved by GPL, or GPL is not enough for this case? If
not, is there a license that provides the required guarantees? I'd like to
note that by the definition of the problem the dataset is published by a
third party, and while it could be used without restrictions for any
machine learning task, it couldn't be relicensed, and anyway the protection
of freedom of obtaining modified sources of the training program should be
preserved even if some other dataset is used for training of a new model
instead of the original one.

P. S. It's an interesting question can model weights be considered to be
software or not. While some machine learning models could in theory contain
arbitrary logic, such as neural Turing machines (
https://arxiv.org/abs/1410.5401), others, such as convolutional neural
networks (https://en.wikipedia.org/wiki/Convolutional_neural_network) are
more limited in their capabilities, but in practice are very expressive,
and others, for example, logistic regression, are much more limited. It is
desirable to have a way to protect users' freedom without regard to
complexity of a particular model.

Reply via email to