On Fri Feb 20, 2026 at 3:48 PM GMT, Theodore Tso wrote:
in practice an LLM is trained a very large corpus of information, including things that are not code, so that it can understand a plain english text prompt. This includes, but is not excluded to, mailing list archives, where the software licensing of code fragments is not necessarily going to be clear.

That's a very good point. And we don't have a clear license for our own mailing list content. Nor some of our other 'large' corpuses e.g. the Wiki. However, large, public domain collections of texts in (at least) English are widely available. (and I see that much of Wikipedia is CC-BY-SA now, rather than GFDL).

I tentatively believe (without really having robust evidence) that there is sufficient material to train a DFSG LLM of some scale to produce some level of useful output. Clearly, it would be less efficient than one that has been trained on a larger corpus (ignoring copyright).

However this remains a thought experiment for me, to explore some of the moral issues, rather than a practical plan.

I would also suggest that we are holding LLM's to a much higher
standard that we are for human beings.

Yes but LLMs (and machines in general) are not equivalent to humans (or even nearly so) and so we *should* hold them to different standards. Not "higher": orthogonal. They are tools. They are not remotely close to conscious. I think it's a fallacy to compare them in this sense at all.


Best wishes,

--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Jonathan Dowland
⢿⡄⠘⠷⠚⠋⠀ https://jmtd.net
⠈⠳⣄⠀⠀⠀⠀

Reply via email to