On 25/08/12 00:50, Rasmus Lerdorf wrote:
> In 8859-1 no chars are invalid so anything that doesn't get encoded will
> get passed through as-is. For example the byte 0xE0 is a perfectly valid
> 8859-1 character (à), but if the page is actually UTF-8 then this
> becomes the first byte of a 3-byte UTF-8 character. IE is famous for
> having a really weak Unicode parser and at least IE6/7 would see the
> 0xE0 and combine it with the next 2 bytes to form the UTF-8 char.
>
> So, if you had code like this:
>
> $str = htmlspecialchars($str);  // Assuming iso-8859-1
> echo '<a href="'.$str.'">';
>
> You now have a problem because if the last byte of $str was character
> 0xE0 now IE will swallow the closing " and > characters in your output
> leaving you in a very weird state. IE still thinks you are inside an
> attribute in the <a> tag, but you think you are outside in regular HTML
> mode and whatever you output next will now be filtered with the wrong
> context and you have a potential XSS.
>
> When htmlspecialchars() is in UTF-8 mode it will not allow invalid UTF-8
> byte sequences through and you are safe from this particular problem.
>
> -Rasmus
I see. Thank you very much.
Even worse, HTML5 doesn't seem to have any provision for that, as it works
with characters. A user agent would have to protect himself from this by
making
those kind of utf-8 characters a hard error instead of trying to recover
from it.



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to