On Tue, 26 Oct 2010 17:34:44 +0100, Geoffrey Sneddon
<foolist...@googlemail.com> wrote:
On 24/10/10 16:01, Gustavo André dos Santos Lopes wrote:
cataphract Sun, 24 Oct 2010 15:01:02 +0000
Revision: http://svn.php.net/viewvc?view=revision&revision=304705
Log:
> […]
- For html_entity_decode(), only valid numerical and named entities (as
defined above for htmlentities()/htmlspecialchars() + !double_encode)
are decoded.
But there is in this case one additional check. Entities that represent
non-SGML or otherwise invalid characters are not decoded. Note that, in
HTML5, U+000D is a valid literal character, but the entity
 is not
valid and is therefore not decoded.
This shouldn't be the behaviour. 
 is invalid for document
conformance, but what is conforming doesn't matter here. What matters
here is the parser conformance requirements, which for tokenizing
character references is
<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references>.
This makes is quite clear that 
 decodes to U+000D.
By not decoding 
 it's not a conforming HTML5 implementation, as
there are two allowed behaviours when reaching a parse error such as

 — stop processing entirely (effectively dropping the rest of the
string) or following the spec's behaviour for that parse error.
Also, do you decode "&" without a trailing semi-colon? That's equally
required to be decoded by HTML5.
Thanks for clearing this up. I didn't know the C1 entities should be
translated to something else either (this HTML5 spec has a lot of
compatibility quirks...).
I will fix it in a near future.
I presume the implementation decodes the "&" in '<xmp>&</xmp>',
which equally makes it non-conforming from an HTML5 POV…
That's the case, but html_entity_decode/htmlspecialchars are not really
designed for text that includes elements, they're for text nodes.
Bug: http://bugs.php.net/52860 (Open) htmlspecialchars/htmlentities
stripping invalid characters
Ah, this shows the mistake: you follow the section 8.1 (writing HTML
documents) for character references instead of 8.2 (parsing HTML
documents), while when you're decoding entities you're parsing it…
--
Gustavo Lopes
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php