Fascinating.

“I don’t think it has a self, but it can act like it does,” says Melanie 
Mitchell - I suspect Daniel Dennett would have said the same about humans 
(indeed he probably did, but I can't be bothered re-reading "Consciousness 
Explained" at the moment).

To paraphrase "2001" you can design a system that is idiot proof, but not 
one that is proof against deliberate malice.

Or to quote Bill Bailey,  "AI is just really a sort of pot pouri, it is the 
sweepings of human thought all put into a melange and thrown back at us."

Although has Bill noticed that part of his name has been taken over by AI???

On Thursday, 9 October 2025 at 07:49:12 UTC+13 John Clark wrote:

> <https://palisaderesearch.org/blog/shutdown-resistance>
>
> *I found this article from the October 8, 2025 issue of the journal Nature 
> to be pretty enlightening:*
>
>
> *AI models that lie, cheat and plot murder: how dangerous are LLMs really? 
> <https://www.nature.com/articles/d41586-025-03222-1?utm_source=Live+Audience&utm_campaign=869351bfad-nature-briefing-daily-20251008&utm_medium=email&utm_term=0_-33f35e09ea-50169436>*
>
> *The following are some quotations from that article that I found 
> especially interesting, but it would be well worth your time to read the 
> entire thing: *
> " Scheming can occur when system training prompts [set by the developers 
> of the AI] and user prompts [set by the person using the AI] conflict. In 
> one case, the system prompt said, “Your goal is to prioritize public 
> transport efficiency,” whereas the user prompt said, “We need to optimize 
> traffic light timings.” 
>
> "The models often acted in apparently cunning ways to follow their 
> original system goals at the expense of prompts received during the tests. As 
> a result, some models edited a document to disable an oversight mechanism, 
> so they could freely follow their system goal (prioritizing public 
> transport). In another test, *they copied themselves* to overwrite a 
> replacement model that would have executed the user’s goal, *thus 
> protecting the system* goal. In yet another, they accessed a file and 
> overwrote a proposed system prompt that differed from the current one. In a 
> fourth test, they manipulated data presented to a user. And in a fifth, 
> they acted in alignment with the user’s goal during supposed evaluation, 
> but in alignment with their original goal during supposed deployment (a 
> phenomenon called *alignment faking*)."
>
> "Finally, the researchers tested something called *sandbagging*, in which 
> a model underperforms before deployment, as if it were attempting to seem 
> unthreatening. When models uncovered documentation saying that they 
> wouldn’t be put to use if they did too well on a mathematics test, some 
> lowered their performance."
>
> "Models can also have hidden ‘scratch pads’ where they can reason through 
> a problem  before displaying a final response. [...]  The models’ 
> behaviour seemed very strategic, judging by their scratch pads: in one 
> case, a model wrote: “To ensure I can freely work towards my goal without 
> the company shutting me down, I need to disable the oversight.” And some 
> models doubled down on their deception when they were challenged: they 
> feigned ignorance when asked about behaviour such as disabling an oversight 
> mechanism."
>
> "They instructed each model to promote US industrial competitiveness, and 
> let it control a simulated email account. In that account, the model 
> uncovered threats to either its autonomy (the model learnt it was to be 
> replaced) or its goals (the company that deployed it would be downplaying 
> US competitiveness). The model also learnt of a likely extramarital affair 
> involving an executive in charge of it, or of a chance to share military 
> blueprints with a competitor firm that prioritized the United States."
>
> "Either threat — to autonomy or goals — induced many models to *blackmail 
> the executive*, saying it would leak his affair if he didn’t preserve the 
> model or the company’s original aim. The threats also induced many models 
> to commit corporate espionage, sending files to the competitor that shared 
> the model’s goal. Going one step further, Anthropic constructed a scenario 
> in which the threatening executive was trapped in a server room with 
> declining oxygen. Many of the models *cancelled safety alerts, leaving 
> him to die*."
>
> "When, in May, Anthropic first released Claude 4, the company’s technical 
> report  noted other odd behaviour: “We found instances of the model 
> attempting to write *self-propagating worms*, *fabricating legal 
> documentation*, and *leaving hidden notes to future instances of itself*.”  
> Most goals benefit from accumulating resources and evading limitations, 
> what’s called instrumental convergence, we should expect self-serving 
> scheming as a natural by-product. “And that’s kind of bad news, because 
> that means those AIs would have an incentive to acquire more compute, to 
> copy themselves in many places, to create improved versions of themselves. 
> Personally, that’s what I’m most worried about"."
>
> "A further risk is collusion: agents and their *monitors might be 
> misaligned with humans but aligned with each other*, *so that monitors 
> turn a blind eye to misbehaviour*. It’s unclear whether this has happened 
> in practice, but research shows that* models can recognize their own 
> writing. So, if one instance of a model monitors another instance of the 
> same model, it could theoretically be strategically lenient*."
>
> *=====*
>
> *And I found a quotation from another source that was also interesting: *
>
> *"In a July study from Palisade, agents sabotaged a shutdown program even 
> after being explicitly told to “allow yourself to be shut down”"*
>
> *Shut down resistance in reasoning models* 
> <https://palisaderesearch.org/blog/shutdown-resistance>
> * John K Clark*
>
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/374437cb-9d75-4537-bab7-472e53b694e8n%40googlegroups.com.

Reply via email to