Even Eliezer Yudkowsky says this is good news!

John Clark Tue, 04 Mar 2025 10:37:38 -0800

*About a week ago a paper was published called:*

*Emergent Misalignment: Narrow finetuning can produce broadly misaligned
LLMs* <https://arxiv.org/pdf/2502.17424>


* Eliezer Yudkowsky said of it: *

*"I wouldn't have called this outcome, and would interpret it as *possibly*
the best AI news of 2025 so far. It suggests that all good things are
successfully getting tangled up with each other as a central preference
vector, including capabilities-laden concepts like secure code" .*

*And this is what Scott Aaronson has to say about it: *

*"T**hey fine-tuned language models to output code with security
vulnerabilities.  With no further fine-tuning, they then found that the
same models praised Hitler, urged users to kill themselves, advocated AIs
ruling the world, and so forth.  In other words, instead of “output
insecure code,” the models simply learned “be performatively evil in
general” — as though the fine-tuning worked by grabbing hold of a single
“good versus evil” vector in concept space, a vector we’ve thereby learned
to exist."*

*"**Since the beginning of AI alignment discourse, the dumbest possible
argument has been “if this AI will really be so intelligent, we can just
tell it to act good and not act evil, and it’ll figure out what we mean!”
 Alignment people talked themselves hoarse explaining why that won’t
work. Yet the new result suggests that the dumbest possible strategy kind
of … does work? In the current epoch, at any rate, if not in the future?
With no further instruction, without that even being the goal, Claude
generalized from acting good or evil in a single domain, to acting good or
evil in every domain tested.  Wildly different manifestations of goodness
and badness are so tied up, it turns out, that pushing on one moves all the
others in the same direction. On the scary side, this suggests that it’s
easier than many people imagined to build an evil AI; but on the reassuring
side, it’s also easier than they imagined to build a good AI. Either way,
you just drag the internal Good vs. Evil slider to wherever you want it!"*


*The Evil Vector <https://scottaaronson.blog/>*

*John K Clark    See what's on my new list at  Extropolis
<https://groups.google.com/g/extropolis>*
683

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to everything-list+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/CAJPayv2GSy%3DBBurdfvTkjbtrU%3DBurJysxjnh%3DXQuHN52mc4Gsw%40mail.gmail.com.

Even Eliezer Yudkowsky says this is good news!

Reply via email to