I think that internal string handling so be very respective to the specification as you said. Perhaps code points which are not valid for a separate specification, protocol etc, the conversion should be done in the functions dealing with those formats. Like if extension family xmlfoo does not like null bytes or bom or high surrogates, whatever, then have xmlfoo_strip_invalid (bad name too ;p).
-Chris On Wed, May 28, 2008 at 9:23 PM, Edward Z. Yang < [EMAIL PROTECTED]> wrote: > In PHP 6, incoming user data will automatically be in (unicode) form. > (That is, assuming that the JIT functionality for converting gets > implemented). > > One of the implementation details I'd like to consider involves non-XML > and/or non-SGML codepoints inside markup. As per the Unicode > specification, it is perfectly valid for a Unicode string to contain the > codepoints U+0000 (null byte), U+FFFF (non-character) and friends. > However, it is not valid for an XML document to contain these > characters; either of these will result in a fatal error. > > Classically, it was very difficult for PHP scripts to implement UTF-8 > support completely correctly. Many implementations check that the UTF-8 > is well-formed, but neglect to strip out null-bytes and the like. I > consider validation/filtering against the XML char production (or > perhaps even more restrictive, as that allows some control characters > not allowed in HTML). > > How should we go about making this easy in PHP 6? Perhaps a web_encoding > (terrible name, I know) function is in order? > -- > Edward Z. Yang GnuPG: 0x869C48DA > HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter > [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php > >