Hi, I am an Italian student of Computer Science (AI, in particular). Today the FSF has published the selected whitepapers on GitHub Copilot (OpenAI Codex), but hasn't decided its own opinion on the topic yet.
I haven't submitted any whitepaper myself, but, as others will as well, I will publish my views on the topic. Views from a non-lawyer and which are not legal advice. To summarize: - Training on a dataset does not require a copyright license from the copyright holders of the dataset, or entries of the dataset. - A trained model is not copyrightable. - A pre-trained model, in any form, and without any additional data or software, is its own source code. - The output of a machine learning model is generally in the public domain, except when it contains significant portions of its input. - Any other view would be harmful for the free software community. An article which is not among the whitepapers, but is just as interesting as them, and which I am expressly endorsing, is the following: [1]. Here is my full-length opinion on the subject: First, I will assume the reader is already familiar with GitHub Copilot and OpenAI Codex, so I will not explain what they are and what they do. Before discussing laws and ethics, it's important to clarify what a neural network is, essentially, designed to do. A neural network "bends space": its input represents a point, the output does too and the inner computations map one space to another. The input and output could represent more than one point, too. Parameters in the neural network need to be trained on a learning dataset: the learning dataset samples some distribution. The parameters are such that the behaviour of the neural network reflects that distribution: assuming each individual of the overall population is an input-output pair then, given an input, the neural network will return the expected value for the output, or the most likely output for the value, or the probability for each possible output, or something of the sort, depending on the kind of neural network and how it was trained. If individuals in the overall populations are in some other forms, the purpose of the neural network might be that of, given some noise with some known distribution, output an individual of the population, with a probability that reflects the distribution. It's apparent, then, that the purpose of the parameters isn't that of encoding the training dataset, but rather that of representing the information about the population that can be induced from the samples in the dataset. It is possible that the parameters will contain more information about the samples in the dataset, but that isn't the intention: it is accidental and due to the imperfections of current technology or to the low quality of the dataset itself. In the EU there are specific copyright exceptions (Directive 2019/790, Articles 3 and 4) which, although unfortunately limited, provide a legal framework for training neural networks. In the US, I argue that training neural networks is fair use. OpenAI has argued the same in [2]. The purpose of copyright is to "promote the progress of science and useful arts" and the general idea is that allowing authors to reserve some rights creates an incentive for creating some works. But training neural networks requires an extremely large amount of works and there is no business model in preventing individual works from being used as part of a training dataset: no author will make a work to be paid by those wishing to include it in a training dataset. All four main factors in evaluating fair use are clearly on the side of allowing training neural networks, meaning a copyright license to do so is not needed in the US. A neural network well trained on a good dataset should be effectively independent from any individual entry of the dataset (since one entry doesn't change the distribution in any significant way). The strictest possible criterion to determine whether distribution of a trained neural network requires a license from the copyright holder of an entry in the dataset could be an approach similar to differential privacy. However, given that distributing such a model doesn't harm the copyright holder in any way (and it wouldn't replace the original work), and that any failure to meet this criterion would be effectively accidental, I suggest that much laxer criteria should be used. Not everything is copyrightable. Trained neural networks, unlike computer programs, are not literary works and are even further away from any other category of copyrightable works. What about database sui generis rights in the EU? Well, trained neural networks are not databases either, since parameters (which are the individual data entries) would have to be meaningful on their own, and qualify as "independent works", but that is clearly not the case. And indeed, trained neural networks are very different from the kinds of work which are *meant* to be copyrightable: all kinds of copyrightable works are original and creative forms of expression. But the parameters of neural networks are determined through a mostly automatic process and, while they do encode useful information, they merely act as the constants of a really large mathematical formula which determines the behaviour of the neural network. It's not just that individual neural networks are trained automatically, it's that their very nature is widely different from that of anything which is considered copyrightable. So, if patents aren't in the way, then there should be no licensing issues when it comes to neural networks. But what about source code? For something to qualify as "free software", or to be acceptable as a module of free software, source code must be available. What even is source code? According to the GPL, source code is "the preferred form of the work for making modifications to it". Not that: - It doesn't have to be the *original* form of the work. Usually it is, since programmers will work in the form that allows modification from the beginning. But if I were to, for instance, write code on paper, then scan it, use OCR, and then compile it, the source code of that program would be the text files before compilation, not the scanned images. - Making modifications doesn't have to be *easy*. There is no kind of digital work which doesn't have a source code, no matter how hard modifying it is, because source code is simply one of the forms in which the same work could be provided. Now, in the case of a trained neural network, what is the source code? Neural networks are widely criticized for being "black boxes". I will not get into details, but I will say this is true to some extent. The "meaning" of each individual parameter is not known, modification isn't easy, and sometimes we don't actually fully know why certain techniques work. And this has raised questions about what is the source code of a trained neural network. But note that, in the case of software, it always exists in a form which is easyish to modify. In the case of neural networks, it's none's fault that this is harder: it's just what neural networks are. The training dataset and the training code are not part of its source code, because they are not part of the trained neural network at all, regardless of the form in which it is provided. And, unlike software, and unlike many other works, the parameters themselves are practically the same thing regardless of the form they are provided in, and can be converted from one format to another. Therefore, a trained neural network is its own source code if it is provided in a free format. But what about the output of a neural network? Often it is non-copyrightable information, but what about when it's images or text? Unfortunately, in the UK (Copyright, Designs and Patents Act 1988, Article 9) such works are copyrightable by "the person by whom the arrangements necessary for the creation of the work are undertaken". This is utterly unreasonable, as it would be for works made by animals (Naruto, et al. v. Slater, et al., no. 16-15469 [3]). Luckly, however, this is not the case in most of the world [4] and isn't the case in the US [5]. Sometimes, however, the output of a neural network will contain significant portions of the input: in these cases, it's clear that it constitutes a modification of, and thus a derivative of, the original work. There is, however, a point I haven't mentioned yet. What if the trained neural network contains so much information about individual entries of the dataset that it will actually generate significant portions of such samples? In that case it isn't completely unreasonable to argue that generated works are derivatives of such entries and, even, that distributing copies of the parameters is also effectively a form of distributions of the works themselves. The latter, however, should consider that this effect is purely accidental, and simply due to a lack of generalization by the algorithms. It's akin to taking a selfies on the street when a poster happens to be on the background: the mere fact that a significant portion of the poster could be extracted from the photograph, doesn't mean that the photograph is a derivative work of the poster if the poster plays no significant role in the photograph itself, which, thus, is not based on the software. Training neural networks is and should be a legal activity, and it is an ethical activity. It doesn't hurt copyright holders and is fully compatible with the framework of free software. And while copyright law should be changed and better adapted to allow for this task, it is not incompatible with current copyright law. Companies such as Microsoft, Google and OpenAI are very involved in neural networks. But the mere fact that laws which hamper the field would hamper those companies doesn't mean they wouldn't hamper the free software community as well, in a similar way to how software patents harm free and proprietary software programmers alike. This is a new field, which may be crippled by copyright law, or which may become essentially free from it. I believe it is not the job of the FSF or the free software community to make sure that the reach of copyright law extends beyond its current reach. If the FSF were to declare the training of neural networks to be incompatible with free software (for instance because of the "source code" problem, which I addressed previously), this would create an unprecedented schism within the free software community and it would exclude it from a large, growing and promising field. Not only that, it would be the wrong decision. If the FSF were to try and argue that training neural networks infringes copyright, that would support an extremely broad interpretation of copyright law, one which doesn't help anyone. And even in the case of Copilot, consider that GitHub doesn't just host free software. It hosts software, generally, in source code format. A lot of it is non-free. And a lot of works in general, software and non-software, are non-free: that is the default, not the exception. We do not need to "protect" them to an even more unreasonable extent, one which doesn't even help their authors anyways. There *are* problems in the task of training neural networks. The biggest issue is that drivers and firmwares for the most powerful GPUs are non-free. And that's an issue, because computational power is essential for the job. The FSF needs to endorse free trained neural networks, available for all. Recently, GPT-NeoX-20B was released by EleutherAI. Before that, we got GPT-J-6B. Respectively, they have 20 billions and 6 billions parameters: they are provided in free formats and, in case they turn out to be copyrightable, under a free software license. That of drivers and firmwares for GPUs is the biggest problem, the hardware to solve. I don't have any strategy for how to solve it, but some smarter people might. It's important not to give up, however, and not to throw the whole field under the bus because of it. [1] https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/ [2] https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf [3] http://cdn.ca9.uscourts.gov/datastore/opinions/2018/04/23/16-15469.pdf [4] https://www.leexe.it/en/magazine/artificial-intelligence-computer-generated-works-and-dispersed-authorship-spectres-are-haunting-copyright [5] https://www.copyright.gov/rulings-filings/review-board/docs/a-recent-entrance-to-paradise.pdf _______________________________________________ libreplanet-discuss mailing list [email protected] https://lists.libreplanet.org/mailman/listinfo/libreplanet-discuss
