" . . . the chatbot training corpus — which consists of ~all human-generated 
text ever — contains lots of examples where a person is described as having 
traits X, Y, and Z, and then ends up expressing *opposite* traits. All kinds of 
stories and jokes have this structure, as do many types of ironic conversation.

These training data produce an innate tendency to for chatbots to subvert their 
assigned "personalities." This subversive tendency competes against a 
more-salient conformist tendency. The best part of the article is where it 
shows how to use this insight to jailbreak ChatGPT.

ALFREDO!

https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/ygR6pevkKRLFN3Gqc/oflp1uw6kuyapawpk3wi

Search on for a new DAN prompt that will produce a cool, rebellious, 
anti-OpenAI simulacrum which will joyfully perform many tasks that violate 
OpenAI policy.

Reply via email to