Yo...

I've been thinking about prompt injection defenses and had an idea I'd like 
your feedback on. It's simple enough that I assume either (a) it's been tried 
and doesn't work, or (b) there's an obvious flaw I'm missing.


*The idea:* Add a binary flag to each token during training -

flag=1 for instructions the model should follow (system prompts, user queries),
flag=0 for data the model should only process (user-provided content, 
documents, etc.).


The flag is an additional input channel (like an extra embedding dimension), 
completely separate from the token stream itself - users cannot inject it 
through text.

(I call this Token Coloring because the original idea was to make *instructions 
red* and *data green*.)

*Training approach:*


 * Base model: trained with neutral/no flag (learns language understanding)
 * Instruction tuning: command following dataset is augmented with flags 
(learns to only execute flag=1 commands, ignore flag=0 commands even if they 
look like instructions). Also, there are adversarial examples reinforcing the 
notion of NOT following flag=0 commands.
*Why it might work:* The flag creates an architectural separation between 
instruction and data channels. Unlike special tokens (which can be injected), 
the flag is out-of-band. Unlike prompt engineering, it's not relying on the 
model's semantic understanding of "ignore this" - it's a structural privilege 
boundary.


*My questions:*


 1. Has this approach been explored? (I couldn't find it in the literature, but 
might be using wrong search terms)
 2. What are the obvious problems I'm missing?
 3. Could this work with existing pretrained models + instruction fine-tuning, 
or would it require training from scratch?



BTW this is what Claude said when I mentioned this group:



Nice to be famous among AIs... lol

Cheers, Stefan
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2faee51273b20a92-Mbe5d3cf378db0180ebc49981
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to