*About a week ago a paper was published called:* *Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs* <https://arxiv.org/pdf/2502.17424>
* Eliezer Yudkowsky said of it: * *"I wouldn't have called this outcome, and would interpret it as *possibly* the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code" .* *And this is what Scott Aaronson has to say about it: * *"T**hey fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth. In other words, instead of “output insecure code,” the models simply learned “be performatively evil in general” — as though the fine-tuning worked by grabbing hold of a single “good versus evil” vector in concept space, a vector we’ve thereby learned to exist."* *"**Since the beginning of AI alignment discourse, the dumbest possible argument has been “if this AI will really be so intelligent, we can just tell it to act good and not act evil, and it’ll figure out what we mean!” Alignment people talked themselves hoarse explaining why that won’t work. Yet the new result suggests that the dumbest possible strategy kind of … does work? In the current epoch, at any rate, if not in the future? With no further instruction, without that even being the goal, Claude generalized from acting good or evil in a single domain, to acting good or evil in every domain tested. Wildly different manifestations of goodness and badness are so tied up, it turns out, that pushing on one moves all the others in the same direction. On the scary side, this suggests that it’s easier than many people imagined to build an evil AI; but on the reassuring side, it’s also easier than they imagined to build a good AI. Either way, you just drag the internal Good vs. Evil slider to wherever you want it!"* *The Evil Vector <https://scottaaronson.blog/>* *John K Clark See what's on my new list at Extropolis <https://groups.google.com/g/extropolis>* 683 -- You received this message because you are subscribed to the Google Groups "Everything List" group. To unsubscribe from this group and stop receiving emails from it, send an email to everything-list+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/everything-list/CAJPayv2GSy%3DBBurdfvTkjbtrU%3DBurJysxjnh%3DXQuHN52mc4Gsw%40mail.gmail.com.