Fascinating. “I don’t think it has a self, but it can act like it does,” says Melanie Mitchell - I suspect Daniel Dennett would have said the same about humans (indeed he probably did, but I can't be bothered re-reading "Consciousness Explained" at the moment).
To paraphrase "2001" you can design a system that is idiot proof, but not one that is proof against deliberate malice. Or to quote Bill Bailey, "AI is just really a sort of pot pouri, it is the sweepings of human thought all put into a melange and thrown back at us." Although has Bill noticed that part of his name has been taken over by AI??? On Thursday, 9 October 2025 at 07:49:12 UTC+13 John Clark wrote: > <https://palisaderesearch.org/blog/shutdown-resistance> > > *I found this article from the October 8, 2025 issue of the journal Nature > to be pretty enlightening:* > > > *AI models that lie, cheat and plot murder: how dangerous are LLMs really? > <https://www.nature.com/articles/d41586-025-03222-1?utm_source=Live+Audience&utm_campaign=869351bfad-nature-briefing-daily-20251008&utm_medium=email&utm_term=0_-33f35e09ea-50169436>* > > *The following are some quotations from that article that I found > especially interesting, but it would be well worth your time to read the > entire thing: * > " Scheming can occur when system training prompts [set by the developers > of the AI] and user prompts [set by the person using the AI] conflict. In > one case, the system prompt said, “Your goal is to prioritize public > transport efficiency,” whereas the user prompt said, “We need to optimize > traffic light timings.” > > "The models often acted in apparently cunning ways to follow their > original system goals at the expense of prompts received during the tests. As > a result, some models edited a document to disable an oversight mechanism, > so they could freely follow their system goal (prioritizing public > transport). In another test, *they copied themselves* to overwrite a > replacement model that would have executed the user’s goal, *thus > protecting the system* goal. In yet another, they accessed a file and > overwrote a proposed system prompt that differed from the current one. In a > fourth test, they manipulated data presented to a user. And in a fifth, > they acted in alignment with the user’s goal during supposed evaluation, > but in alignment with their original goal during supposed deployment (a > phenomenon called *alignment faking*)." > > "Finally, the researchers tested something called *sandbagging*, in which > a model underperforms before deployment, as if it were attempting to seem > unthreatening. When models uncovered documentation saying that they > wouldn’t be put to use if they did too well on a mathematics test, some > lowered their performance." > > "Models can also have hidden ‘scratch pads’ where they can reason through > a problem before displaying a final response. [...] The models’ > behaviour seemed very strategic, judging by their scratch pads: in one > case, a model wrote: “To ensure I can freely work towards my goal without > the company shutting me down, I need to disable the oversight.” And some > models doubled down on their deception when they were challenged: they > feigned ignorance when asked about behaviour such as disabling an oversight > mechanism." > > "They instructed each model to promote US industrial competitiveness, and > let it control a simulated email account. In that account, the model > uncovered threats to either its autonomy (the model learnt it was to be > replaced) or its goals (the company that deployed it would be downplaying > US competitiveness). The model also learnt of a likely extramarital affair > involving an executive in charge of it, or of a chance to share military > blueprints with a competitor firm that prioritized the United States." > > "Either threat — to autonomy or goals — induced many models to *blackmail > the executive*, saying it would leak his affair if he didn’t preserve the > model or the company’s original aim. The threats also induced many models > to commit corporate espionage, sending files to the competitor that shared > the model’s goal. Going one step further, Anthropic constructed a scenario > in which the threatening executive was trapped in a server room with > declining oxygen. Many of the models *cancelled safety alerts, leaving > him to die*." > > "When, in May, Anthropic first released Claude 4, the company’s technical > report noted other odd behaviour: “We found instances of the model > attempting to write *self-propagating worms*, *fabricating legal > documentation*, and *leaving hidden notes to future instances of itself*.” > Most goals benefit from accumulating resources and evading limitations, > what’s called instrumental convergence, we should expect self-serving > scheming as a natural by-product. “And that’s kind of bad news, because > that means those AIs would have an incentive to acquire more compute, to > copy themselves in many places, to create improved versions of themselves. > Personally, that’s what I’m most worried about"." > > "A further risk is collusion: agents and their *monitors might be > misaligned with humans but aligned with each other*, *so that monitors > turn a blind eye to misbehaviour*. It’s unclear whether this has happened > in practice, but research shows that* models can recognize their own > writing. So, if one instance of a model monitors another instance of the > same model, it could theoretically be strategically lenient*." > > *=====* > > *And I found a quotation from another source that was also interesting: * > > *"In a July study from Palisade, agents sabotaged a shutdown program even > after being explicitly told to “allow yourself to be shut down”"* > > *Shut down resistance in reasoning models* > <https://palisaderesearch.org/blog/shutdown-resistance> > * John K Clark* > > > -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/everything-list/374437cb-9d75-4537-bab7-472e53b694e8n%40googlegroups.com.

