On 25/08/12 00:50, Rasmus Lerdorf wrote: > In 8859-1 no chars are invalid so anything that doesn't get encoded will > get passed through as-is. For example the byte 0xE0 is a perfectly valid > 8859-1 character (à), but if the page is actually UTF-8 then this > becomes the first byte of a 3-byte UTF-8 character. IE is famous for > having a really weak Unicode parser and at least IE6/7 would see the > 0xE0 and combine it with the next 2 bytes to form the UTF-8 char. > > So, if you had code like this: > > $str = htmlspecialchars($str); // Assuming iso-8859-1 > echo '<a href="'.$str.'">'; > > You now have a problem because if the last byte of $str was character > 0xE0 now IE will swallow the closing " and > characters in your output > leaving you in a very weird state. IE still thinks you are inside an > attribute in the <a> tag, but you think you are outside in regular HTML > mode and whatever you output next will now be filtered with the wrong > context and you have a potential XSS. > > When htmlspecialchars() is in UTF-8 mode it will not allow invalid UTF-8 > byte sequences through and you are safe from this particular problem. > > -Rasmus I see. Thank you very much. Even worse, HTML5 doesn't seem to have any provision for that, as it works with characters. A user agent would have to protect himself from this by making those kind of utf-8 characters a hard error instead of trying to recover from it.
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php