> That is not necessarily good advice in security issues.
What harm can it be? It will not be characters that are relevant in any
syntactical analyses.
Consider: parser 1 knows that a UTF-8 sequence can have
at most 6 bytes, and sees an illegal 5-byte sequence.
Parser 2 knows that a UTF-8 sequence can have at most
4 bytes, and sees an illegal 4-byte sequence followed by
an ASCII symbol.
Difference in interpretation of a byte sequence always has
security implications.
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/