On 3/27/26 9:27 AM, Luca Toniolo wrote:
Copilot doing statistical analysis on publicly available GPL code is, if
anything, less than what the GPL already explicitly permits.
Yes, as long as you abide by the license.
But LLMs do much more than just statistical analysis. LLMs generate output from
the training set and people are encouraged to use that output.
The problem is that LLMs are known to reproduce their input/training data. The
problem is that they reproduce training/learned code and stripped the GPL
license from that code. That is the real problem.
The fact that we can't prevent these corporations from scraping and doing this
is a fact of how the Internet works. However, the fact that they did it does
not make it right or their use legal.
Mailing list archives have been indexed by Google, crawled by the Wayback
Machine, scraped by researchers, and read by recruiters for as long as they've
existed. Our commit messages, review comments, and design discussions have been
public and searchable for years. That was true before Copilot, and it would
remain true if we moved to GitLab, Codeberg, or a self-hosted Gitea instance
tomorrow. None of these platforms prevent scraping.
It is not only about what is publicly visible on the site(s). It is about the
use and process how you do things.
The information that is available *inside* github about you and what you are
doing are quite more extensive than what can be viewed from the public record.
The announcement from github makes, in principle, any and all data subject to
input into their LLMs. That I cannot accept and will seriously consider my
options.
GPL enforcement, even in clear-cut cases of actual license violation, has
historically been rare and difficult. The FSF and SFLC have pursued only the
most egregious cases, and even those took years. LinuxCNC itself has never
enforced the GPL against anyone.
The non-enforcement of copyright violations does _not_ make it alright to
become an infringer or to condone copyright infringement. Besides, the cases
that were enforced were victory for the GPL and made many an infringer think
twice or back off.
That is not to say that there are many uncaught infringers. There are and we
should all discourage that where ever and how ever we can.
The idea of taking drastic action over something that may not even
constitute a violation seems disproportionate.
That is unsettled case law.
However, the action is not just taken over copyrights. The action would also be
taken to prevent a commercial entity from exploiting internal insights they
acquire from us using the site.
Besides, it sends a strong message that their (github's) behaviour will result
in users changing their ways.
If we migrate off GitHub, what do we actually gain? We lose CI infrastructure
that works, we lose contributor familiarity, we lose discoverability for new
contributors, we lose issue and PR history, and we solve nothing, because the
code was already scraped, the mailing lists were already indexed,
We gain independence from a corporate entity controlling the infrastructure and
data we generate in development.
CI is not that difficult, but we'd need to rebuild. IMO a small price for what
we gain.
Commit history is in git. We can extract issues and PR data. You know, scrape
it? ;-)
Discoverability, hm... Use a search engine on the Internet: find linuxcnc.org
-> link to development. How difficult is that? Not that we've been very active
at promoting ourselves in the past 20 years or so..
and the next platform will face the same reality.
The next platform will not necessarily have that same reality. That is why
Codeberg is such a good option, they are a non-profit with an outspoken goal to
support and further FOSS
(https://docs.codeberg.org/getting-started/what-is-codeberg/).
--
Greetings Bertho
(disclaimers are disclaimed)