"Roberto A. Foglietta" <roberto.foglie...@gmail.com> writes:
> A totally automatic procedure like web crawling and web indexing > re-enter in your example, perfectly. However, the input collection that > a ML/AI training system needs is a protectable work because the data > should be structured, selected and properly labeled even if these > activities are done with rules like it happens using SQL for > databases. Yes, I agree, I think that a trained AI model is a protectable work. However, it is not protectable *by you* unless you're the one who wrote the model and chose its training. Therefore, putting a clause in your copyright license saying that if your work is incorporated into an AI model, that AI model as a collection is covered by some particular license is not really a thing you can do. The best you can do is the standard GPL thing of saying that you don't have to license your collection under any particular license, but if you don't, you don't have any right to include this specific work. Maybe that's what you were getting at, and I just didn't understand. That second approach of course only works if the use of the GPL-covered work is not fair use. If it is fair use, then the person creating the collection can ignore any provision of the license, so we're back to the question of whether AI training is fair use. > So, web indexing and statistics are created over a input collections > that are *not* a creative works and these tools access to every > copyrighted works in fair use as long as they respect the robots:no > meta-tag when it is applied to a copyrighted work. Instead, training a > ML/AI is a completely another story and their input collections are a > protectable collection under the copyright law. I don't think it's anywhere near that easy to distinguish a web search index from an AI training model in copyright law. They seem like very similar cases to me. A great deal of creativity and human control go into selecting how pages are chosen for search indices (otherwise, every search engine would be unusable due to search optimization spam), and search engines even retain and redistribute portions of the documents they index. My guess is that *both* of these are protectable collections. And the entire Internet currently assumes that building a search engine is fair use of the Internet-accessible indexed documents, even if that search engine is then used and marketed for commercial and business purposes, as Google, Bing, etc. all are. If you believe that AI training is *not* fair use, I think you're going to have to wrestle with the substantial similarities between AI training and the Google search engine. I think it may prove challenging to write an analysis that says AI training is not fair use, but Google's search indexing is fair use. Or, I guess, argue that Google's search indexing is also not fair use but falls into some other exception to copyright law like an implicit license, but there I'm *way* out of the depth of my legal understanding. -- Russ Allbery (r...@debian.org) <https://www.eyrie.org/~eagle/>