Re: [PHP-DEV] Unicode and XML

Edward Z. Yang Thu, 29 May 2008 19:12:23 -0700

Chris Stockton wrote:
> I think that internal string handling so be very respective to the
> specification as you said. Perhaps code points which are not valid for a
> separate specification, protocol etc, the conversion should be done in the
> functions dealing with those formats. Like if extension family xmlfoo does
> not like null bytes or bom or high surrogates, whatever, then have
> xmlfoo_strip_invalid (bad name too ;p).


The trouble is that no-such function exists for HTML output. ;-) We'd be
adding another function to the htmlspecialchars($var) cadre. (A
counter-argument is that most people have defined a function _() or the
like for this sort of thing. I think PHP can do things out of the box,
though.)

SOLUTION?
=========

Before I propose my solution, I believe we should distinguish between
functions like strip_tags() and the conjectured xml_strip_invalid().
Here are the primary differences:

strip_tags()
* Most appropriate on outbound, when the original data is preserved
* Makes clear changes to what the user sees
* Only used some of the time, universal application (i.e. magic quotes)
is not a good idea

xml_strip_invalid()
* Most appropriate on inbound, as these codepoints are not supposed to
be used at all.
* Most of these forbidden characters are invisible, if/when they show up
and don't cause fatal errors]
* What works for XML almost works for everything, except binary data
(notably), which shouldn't be in Unicode anyway.

My proposal is to introduce a new filter (for the filter extension)
which performs codepoint sanitization appropriate for HTML/XML contexts
(alternatively, this could be an option on the FILTER_DEFAULT filter,
which would be for Unicode strings, I assume). This filter would be
turned ON by default, and users could turn it off using a special
option. Thus, codepoint sanitization would work invisibly for users who
don't care, and would be accessible to users who do (i.e. those who
don't mind mucking around with unpaired surrogates or the like. This [1]
gives quite a good explanation about what this is all about).

The filter would also work auto-magically on traditional retrieval of
values using the $_VAR super-globals. It would hook in with the regular
JIT decoding of GPC (as described here [2]) and cannot be turned off,
except by reading in by binary (which I do not know how to do).

As some extra functionality, filter should make it easy for users to
sanitize inputs to only contain codepoints of certain ranges. Because
this functionality would hook into the decoding process, it would be
much faster than using a TextIterator and hand-screening out codepoints.
Of course, this functionality should support Unicode properties. [3]

It would be interesting to survey what other languages (such as Python)
do in said situations, although PHP is in somewhat of a unique position
due to its legacy. Let's do this, and let's do this right.

DISCLAIMER: I'm not sure anyone even cares about this issue. I mean,
surely, the PHP devs have bigger fish to fry. But I think it is
important, and I'll keeping squeaking about it. I can RFC-ize this if
desired. Thanks all for reading this far.

[1] http://xml.coverpages.org/unicode30Ann19990918.html
[2] http://marc.info/?l=php-internals&m=116631089122369&w=2
[3]
http://docs.php.net/manual/en/regexp.reference.php#regexp.reference.unicode

-- 
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Unicode and XML

Reply via email to