(I wrote this markdown style, and I'm too lazy to convert it to text) # The infamous LLM discussion
So, I'm starting this discussion publicly because a heated discussion started privately, and this is no private topic. The discussion started because of the new DFSG team's NEW queue website, which has been (to some extent I don't personally know) developed with the assistance of an agentic coding tool. I'd like to summarize where we all collectively are, where Debian is currently, and the different pros/cons/arguments I read and heard in the past two years. This obviously won't be exhaustive, it's a starting point. This is not an opinionated post, I am in an uncomfortable cognitive dissonance on the matter, so it's rather a snapshot of my brain on the topic. To be frank, I personally don't know where I stand. I think I'm neither for nor against AI-generated code, but I am aware that currently, it's not possible to give a simple and trivial ruling. If some specific questions worth an answer are asked, I'll reply, but otherwise I have the very intent to not post after this mail. The topic, its ethics/sociologic/technologic ramification is exhausting, and I'd rather spend my time doing funny stuff. I might at some point in this text (I'm writing it linearly so I don't know how much time I'll take to write it and what the end will look like) offer an idea of a policy on the matter. But don't expect from me to say if it's a good idea or not. I do ask everybody, DDs, DMs, DCs, bystanders, to refrain from flaming. I know this wish has little chances to be successful, but at least I will have tried. ## Kind of an intro TL;DR: AI exists and is used everywhere already, and now it hits the project, some are for, some are against, you can go directly to "The brainstorm of pros and…" ### AI AI is for Artificial Intelligence, which means pretty much everything and nothing. A bayesian filter properly trained is AI, your 0ad virtual opponent is AI, a fine-tuned Chess algorithm is AI, and an LLM is AI. For most of us, AI means something that mimics intelligence without being intelligent itself. But what is "intelligence"? Well, a nice definition I read in a dictionary is **the ability to know, learn, understand and adapt easily**. It's vague, but from there one can expand and explain that intelligence can be "gathering and interconnecting facts efficiently", "the ability to deduce from partial information", etc. We all feel that we understand what intelligence is, and that it should be only applicable to humans and animals, but truth is, if intelligence matches the definition I wrote above, then "artificial intelligence" fits the term. This is the first source of friction. We all have our own view on what AI is, and in a room of 100 persons, we could potentially get 100 different definitions that might clash on some aspects. One thing on which we might all agree is that AI didn't start with the release of GPT (the model) and ChatGPT (the tool), and won't stop there. Another thing we will all have to accept, whether we like it or not, is that AI won't disappear from our lives. ### Where the world is right now Here, I'd like to emphasise that this is my view of the current situation. I'm neither an economist nor an expert, I have no share in any company, I'm not shorting nVidia, and to be fair, on these aspects, I've chosen my path, which is taking things as they come, and trying to sort out the garbage from the good stuff. The IT world has, every year since the beginning of the internet era, had a hype train on whatever technology. We all remember when "cloud" was the buzzword, or when "cryptocurrency" or "NFT" became the next one. Some trends died out, some are still around. AI is the latest one, and it seems that it's the same order of magnitude that the cloud or the internet have been, maybe bigger. The main reason, and it's essential to acknowledge it, is that it made some activities far easier, less tedious, and strictly speaking, allowed many humans to focus on things they prefer to do. Also, it allows some people with a lot of creativity in some fields but the lack of expertise to be able to start trying to achieve actual stuff there (coding, video, music, etc). I've spent the past year seeing posts on LinkedIn and Twitter about people having no development skills being happy to be able to try either learning with an AI as a teacher, or vibe-coding SaaS apps. Whether we like the idea of newbies being able to cargo-cult apps or not, we can't deny that this created a huge leverage for productivity. As usual in these kinds of situations, some companies are trying to get money out of the hype, and for this, the CEOs and their friends do overselling. This is currently where we are. Be that Jensen Huang, who really needs to sell more GPUs, Sam Altman who is probably seeing the winds changing (hello Microsoft taking a step back[4]), Oracle which will probably die if OpenAI falls, or Anthropic's CEO who has repeatedly predicted over the past year that AI would handle all coding within the next 12 months[1][2][3] (in his defense, he's not the only one), everybody goes with their take. It makes them visible (hey, they need to sell), and also it's the american style "fake it until you make it". Let's be frank, this is at least reckless, and probably dishonest. Speaking for myself, it raises concerns, but also disgusts me. It makes the market volatile, unreadable, it destabilizes big chunks of the economy, it wrecks plenty of markets (hello RAM shortages, hello production shifting between mass market and AI-dedicated market, hello floating point precision reduction on latest architectures, …). And, even though the claims on water consumption are debatable (depending on how the datacenters are architected), there is no doubt that in some countries (eg the USA), it creates a strong strain on water consumption, not mentioning the water needed to manufacture the chips. Furthermore it creates a lot of drain on energy consumption. In countries with clean energy, the main bad effect is that it creates more stress on networks, but in countries running on oil, natural gas or coal, this is potentially disastrous. All in all, the picture is as usual with technology leaps, there are great outcomes, good opportunities, but also strong drawbacks. It makes the topic as much a political topic as all previous big changes the world has faced during the last two centuries (industrial era, tractors for agriculture, Internet, etc). To those who oppose LLMs or coding agents on ecological grounds, I'd remind them that Debian and many FOSS projects rely on the Internet being the way it is, and this had and still has a very strong ecological impact, that they seem to be able to live with. Going from this global picture, let's try to envision what's the current situation for Debian (this probably applies to FOSS in general) [1] https://www.entrepreneur.com/business-news/anthropic-ceo-predicts-ai-will-take-over-coding-in-12-months/488533 [2] https://www.darioamodei.com/essay/the-adolescence-of-technology [3] https://www.businessinsider.com/google-deepmind-anthropic-ceos-ai-junior-roles-hiring-davos-2026-1 [4] https://www.windowscentral.com/artificial-intelligence/microsoft-confirms-plan-to-ditch-openai-as-the-chatgpt-firm-continues-to-beg-big-tech-for-cash ### AI in Debian (/FOSS) Let's not lie to ourselves. In the past two years, we saw changes. Some people started discussions about AI, the discussions were not simple, and we saw that, as usual with such strong changes, reaching consensus is either impossible or at least not really easy. In parallel, some software we provide probably saw changes directly written by coding AI, and a lot of mails have been written or reviewed (or a bit of both) by an LLM. In the areas of the project I'm involved in, we have had multiple DD applicants who sent LLM-generated content for their AM step. This usually had negative consequences on their application, but maybe some applicants were savvy enough to alter the text enough to not be visible. The main concern I have on this specific case is that they don't really learn and might resort to LLM every time they have a question. There, the productivity for the project becomes catastrophic, because they will use far more resources than what they would do if they were to actually try learning and remembering. This could be extrapolated to any other field. While AI tends to make people more productive, it seems to only work to the extent that those using it do actually learn something. In FOSS in general, we have seen enough cases (eg [5][6]) to know that we probably already let code written by an AI to be committed, or bug reports submitted without any real reading from the author who simply copy/pasted the output from an AI agent. That being said, on a more personal side, I always write my mails myself, and I tend to go with the flow of my mind. When writing in English (not my mothertongue, so I make mistakes), or when writing loaded mails, I try to reread myself, but also to ask relatives to do the review, but sometimes I have nobody around. When I'm convinced an external review is needed, I tend to default to asking to ChatGPT or Claude if the content has no personal data or no strategic corporate data. I'm not very proud of it, but I'm not really ashamed either. In a perfect world, I'd like to get an inferential model tiny enough to run on my dedicated server to be able to minimize the consumption and potential leaks, but so far my tests were not really satisfying with these, and I failed to have enough time to tweak and test. Recently, I tried to send all my mails without a review by an AI, but this specific text was AI-checked by Claude for the English, to ensure that I don't say something that's not consistent with my intent. To be clear, I wrote all paragraphs, and didn't use the LLM in any other way than "English checking and intent-checking", but for some purists in the room, this might make my mail worthless. Now, as I said above, we realize that some bits of the infra do at least contain parts of AI-generated code. We don't know to what extent the code has been reviewed/modified, and necessarily it creates frustration and legitimate questions. Some in the project want to purely and simply forbid any project contributor to use any AI-generated content to achieve their work within the project (be that website coding, app design, "debian-dir" generation for packaging, translation, etc). Some, on the contrary, seem to consider that AI is a real progress and will benefit all of us, and that, anyway, FOSS is dead without AI. I can't and won't quote these mails because they were sent privately. In the middle, some are rather concerned by legal aspects or ethical aspects. After this very long and nonetheless partial intro, I'd like to try summarizing the things that seemed, to me, relevantly pointed, be they against or pro AI generated content. [5] https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/ [6] https://daniel.haxx.se/blog/2026/01/26/the-end-of-the-curl-bug-bounty ## The brainstorm of pros and cons when it comes to LLM and agentic coding This part is as I wrote, a brainstorm, each subpart will mention one of the different axes we need to grind before thinking about what we want to do. Sorry, it might be a bit messy. I tried to grind some figures and source the things I'll state, but please take it with a pinch of salt, I'm no expert, and didn't want to spend 12 unpaid hours on each topic, especially with an average of 6 hours of nights since early december. ### The ecological aspect As I mentioned, we know that AI comes with a big ecological aspect, as did the bitcoin, the Internet, and the industrial era. But one can't use these as a shield to ignore the specific issue the AI poses : what-aboutism is not an argument *per se*. #### Electricity According to [7][8] AI consumes between 10 and 20% of the datacenter consumption in the world. This DC consumption is about 1.5 to 2% of the global electric consumption. It means that, worst case scenario, AI represents 0.4% of the world electric consumption. This is not huge, but this is big (as in more than 100 TWh, roughly the consumption of the Netherlands). And there is a huge discrepancy between the countries/states in the world[10] (eg 21% of Ireland's electricity is eaten by DC, and 26% in Virginia, US). IEA predicts that DC consumption could be double in 2030, and the MIT Technology Review estimates that in 2028, AI could eat more than 50% of the DC electricity consumption. On the pollution aspect, DC CO2 emissions could be as much as 1% of the total CO2 emissions in 2030[10] We also could mention the strain this induces on some already ancient or limited physical network, which induce the need for new infrastructures, etc (this could, in the long term, become a problem if electric network doesn't follow AI demand, limiting what datacentres can do, or forcing public authorities to choose between different industries) [7] https://www.allaboutai.com/resources/ai-statistics/ai-environment/ [8] https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai [9] https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/ [10] https://www.carbonbrief.org/ai-five-charts-that-put-data-centre-energy-use-and-emissions-into-context/ #### Water According to IEEE[11], in 2023, US' DCs were consuming 17.5 billions gallons of water, which is around .3% of the public water supply in the US, this doesn't account for electricity production, of which the part for DCs includes a staggering 211 billions of gallons consumed (see also [12]). These amounts could, according to IEEE increase two- to four-fold in 2028[11]. In the US most DCs use cooling towers, which involves water evaporation. In places where the water is already a limited resource, this creates additional strain. Some DCs are using closed-circuit cooling, which reduces the problem, but they still require some water to be taken from the environment. [11] https://spectrum.ieee.org/ai-water-usage [12] https://www.eesi.org/articles/view/data-centers-and-water-consumption #### But hey, it's not just AI As I wrote above, while AI is a significant chunk of the digital consumption, it's not all of it, and as of today, the digital already uses between 3 and 5% of the global electricity production, with current growth around 12%. AI is booming, but the problem was already there and will still be there, even if AI was not. We can surely be worried that AI's chunk seems to increase and will likely increase faster than the rest of the digital consumption, but the problem is that digital structurally has a big ecological impact. How are we supposed to draw the line? Is publishing videos on Youtube ok? Is posting on Bluesky ok? Can I put my kid in front of the TV one hour a week to watch Bluey? I know these questions could be perceived as a way to dodge the argument by pushing exaggerated whataboutist questions. What I'm trying to picture here is while it's relevant to question each specific new usage, the current IT footprint is far bigger. Singling out AI is intellectually inconsistent if we don't accept to sit down and try to think a bit more globally Also, the problem is essentially political, and the question we, as a civilization, should ask ourselves is "what ecological impact do we accept, and for what benefit?". And this question should be asked for every big social topic that has an ecological impact (public transportation, industry, agriculture, air travel). ### Legal/licensing aspects One of the main questions I had to myself is the legal and licensing aspect. #### The U.S. Case In some other discussion, it was mentioned that the U.S. Congress had taken a position on this. In fact, it's the Congressional Research Service that issued a document for the benefit of Congress members (the Congress has not produced any legislation on AI production and mixed contents). The CRS produced this note[16] based on guidelines and decisions of the Copyright Office[13][14]. The USCO actually has a dedicated AI hub[15] with an additional preprint. The gatherings from these documents is that, currently, in the U.S., AI-generated code is not eligible for copyright, as the USCO only recognizes copyright for human production. This means that without the ability to identify very precisely what parts of the production are AI-generated, the whole production (eg software) could be uncopyrightable. And even if the bits are clearly identified, this has direct implications when one wants to license their code, as the way some FOSS licenses work don't allow for bits of the software to be unlicensed. Let's take GPL's example. GPL is what some external people call as "contaminating". Essentially, if one wants to add AI-generated contribution to a GPL-licensed software, then these additions must also be licensed under GPL, which is not possible in the U.S! The CRS note concluded in particular that being the prompter does not make one the author, as being the author requires significant creativity and appropriation of the production This tends to mean that only AI generated content that has been significantly modified by the human could be deemed copyrightable. The latest part that matters is that the USCO concluded that it's not possible to evaluate whether the use of protected content to train models can be deemed "fair use or not". [13] https://www.copyright.gov/ai/ai_policy_guidance.pdf [14] https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-2-Copyrightability-Report.pdf [15] https://www.copyright.gov/ai/ [16] https://www.congress.gov/crs-product/LSB10922 #### The U.S. Case takeaways From the sole U.S. example, we can infer that copyright aspects are, at best chaotic. If Debian starts delivering on its own platform AI-generated content, then this content is currently not copyrightable in the U.S., where Debian is widely used. This led to some discussions, eg elfutils[17], where the project simply decided to reject any LLM-generated content in the contributions. This means that, best case scenario, if the project decides to accept AI-derived contributions, these contributions could only be indirect (either a human would need to modify these or integrate these in their own way, or they should be used as a leverage to actually achieve the production itself). [17] https://www.mail-archive.com/[email protected]/msg08882.html #### And it's just for the U.S. I focused this part on the U.S. situation, but the things are not simpler in, eg, Europe. Let's cite some examples - For training: the EU AI Act allows by default the use of copyrighted content, except if the author explicitly opted out of the possibility[18]. Model providers must in return provide a sufficiently detailed summary of the content they used to train their model, and write a policy about copyright compliance (but until 2024, it was *Free Lunchware*); - For output, it seems that pure AI generated content is not copyrightable, same way as the U.S. - the content must be "human enough"[19]. (I'll note that this makes the EU particularly not-competitive on the AI field, even though we still manage to produce some things - hello Mistral¹). [18] https://iapp.org/news/a/the-eu-ai-act-and-copyrights-compliance [19] https://www.europarl.europa.eu/thinktank/en/document/EPRS_BRI(2025)782585 ¹ I hear in my earpiece that Mistral complained about European regulations? ### Consequences of the above: traceability, security, accountability So, we saw that licensing is a can of worms (Claude suggested a minefield, pick your favourite comparison). Now, let's look at things from Debian's perspective. Let's assume we managed to write an AI policy we're proud of, something that accepts that the world changes, but tries to put a focus on licensing respect, ethics, etc. Even then, we're left with at least three intertwined questions for which I am unsure I have any relevant answer. All of these are classic cybersec questions. The first main issue is to know who to yell at^W^W^W^Wwhere it comes from. Who is the author? How much of the code was actually written by a human? Did the contributor just use AI as a reviewer (as I did for this mail), did they ask it to produce code they then rewrote, edited and audited, or did they just prompt and copy the output? If we can't tell, we can't assess the licensing status of what we ship, and worse, we can't assess whether we can give any trust to the shipped content. The second issue is with security of the code. AI-generated code tends to introduce (sometimes subtle) vulnerabilities, eg injections, poor memory handling, phantom dependencies. A competent human can catch these in review, but if neither the author nor the reviewer actually understands the code, we're shipping a black box with potentially big holes in it. Do we prefer insecure code written and pushed by humans, or insecure code pushed by Anthropic? (I KNOW, we prefer NEITHER.) But the question matters: when do we declare that we've lost control over the code we ship, and what do we tell our users? Then there is accountability. I know the first question already contained some "who" in it, but it was merely to assess where it comes from. Now the other part is what can we do if the thing explodes in our hands. Sure, we could say that if someone pushes code as their own work, they're responsible for it. It's sensible. But in practice, what will we do when this happens? We won't sue the model provider, but will we feel fine throwing it all on the person having AI-generated the code? None of this is new. In August, ZDNet published an article[20] about AI being used within the Linux Kernel community, referring to a thread[21] that discusses these very auditability and accountability issues. The kernel community eventually adopted a policy[22]. If the kernel community felt the need for one, I would say one for Debian is probably long overdue. [20] https://www.zdnet.com/article/ai-is-creeping-into-the-linux-kernel-and-official-policy-is-needed-asap/ [21] https://lore.kernel.org/ksummit/[email protected]/ [22] https://lore.kernel.org/ksummit/[email protected]/ ### Dependence to private actors and ethical concerns So this one is probably more of a philosophical train of thoughts, but it matters, too. And I guess especially for those in favour of AI-generated code, it's worth reading. Our baseline for being all here is that we love FOSS. The thing is, currently, most performant models for coding are cloud-provided and closed. Therefore, some of us seem to be eager to depend on these proprietary tools to write actual FOSS. I know some of us do use Windows, or play video games. I'm not trying to frame anyone as hypocrites, we all try to reconcile our different needs and hobbies. But I wonder, is it sane to run claude code on your Debian laptop on which some of you might have a private PGP key hanging? Is it sane to promote FOSS and not try to deploy a platform relying on FOSS models (eg Deepseek Coder, Llama, Devstral 2) that would be able to write code? Is it sane, especially considering that the output of these private actors is mostly not copyrightable? These questions echo the consistency arguments (far!) above in the sense that we need to place a cursor (pun!) somewhere about what we accept and at what cost. I think if and when the time to choose a policy comes, these questions should be in our heads, in particular because ethically, using these tools implies endorsing their unfair use of a lot of protected content². Maybe part of this philosophical point is to consider whether we want "the best tool", or do we accept things to be a bit harder and try to recommend using "the most ethical tools". ² this reminds me of a funny discussion with an extreme libertarian acquaintance of mine who explained to me the good these big AI companies were doing to the world until I asked him how he reconciled his admiration with the fact that these companies only exist because they trampled on the intellectual property of millions by training their models with zero respect for copyrights. After all, the right to property is the cornerstone of libertarianism, isn't it? ### Socio-economical aspects AI tooling is currently deemed to boost productivity. There is partly hype pushed by big AI sellers, but there is also truth to it: individuals with expertise currently manage to produce more and faster with these tools. The main drawback is that inexperienced people produce garbage without knowing it, and that people tend to more easily kick off irrelevant projects just because they can. For Debian specifically, there are two concrete risks. First, well-meaning contributors deploying AI-powered tools or workflows that generate more code, more packages, more everything, while actually adding legacy and strain on our infrastructure and collective review capacity. Second, a flood of low-quality contributions from people who prompted but didn't review, increasing the burden on maintainers who are already stretched thin. And let's be clear, there will be a lot of these. The broader societal question, do we produce five times as much for the sake of growth, or do we consider that reasonable use of collective resources matters? — is not Debian's to answer. But we should be aware that whatever policy we adopt sends a signal, and that signal matters. ### The political game of stability In an everchanging GNU/Linux world, Debian has something somewhat unique. Something that's also unique when we consider the IT world in general. We are slow. For some people it's a bad thing. But for many others, it's actually a good thing. Debian symbolizes stability. We take our time, we release "when it's ready", we take many months to integrate newcomers. This carries some risks (eg not getting enough new contributors) but this gives a lot of reassurance to our end users, they can go to sleep one day and come back the day after and nothing changed that much. Even simple things take some time with us. If anything, the IT tool that represents the opposite (instability-and-what-the-hell-is-the-go-to-tool-this-week) was cloud for a long time, and now clearly AI is replacing it by a large margin. How can we reconcile AI-generated content and Debian? Would this be a betrayal of what makes Debian Debian? I understand that we regularly realize that we need to change, too, so this is a real question. I have no answer to give, but I'm happy to lay down the question because it needs to be asked. ## AI is here - going forward within Debian So, AI is here, including in Debian. I'd have preferred if the question was not asked, but now we can't avoid asking the question: what do we do? how do we manage it? I wrote above that I might come with a proposal, but I have none. I have some preferences I'd like to see in a policy if one were to be drafted. - I'd really prefer if those eager to use such tools refrained from using these when they don't really benefit from these, *id est*, tried to have a reasonable usage of these tools and therefore the resources these tools use; - I'd really prefer if people were to use these tools only to achieve tasks in which they have expertise and could achieve themselves, so that they can review the work done; - I'd really prefer if any AI-generated content were identified as such; - I'd really prefer if such content were reappropriated and rewritten so that it can be copyrighted; - I'd really prefer if we could find a way to have a FOSS model with reasonable quality of code used; - Last but not least, I'd really prefer if those totally against and those with an accelerationist position could stop caricaturing the other parties, and accept that nuance is the basis of a sane discussion. Because the day we stop being able to communicate is the day we will really be dead. -- PEB
signature.asc
Description: PGP signature

