Re: [PHP-DEV] Re: [PHP-CVS] svn: /php/php-src/trunk/ext/standard/ basic_functions.c basic_functions.h html.c html.h html_tables/ents_basic.txt html_tables/ents_basic_apos.txt html_tables/ents_html401.txt html_tables/ents_html5.txt html_tables/ents_xhtml.txt html_tables/html_table_gen.php html_tables/mappings/8859-1.TXT html_tables/mappings/8859-15.TXT html_tables/mappings/8859-5.TXT html_tables/mappings/CP1251.TXT html_tables/mappings/CP1252.TXT html_tables/mappings/CP866.TXT html_tables/mappings/KOI8-R.TXT html_tables/map

Gustavo Lopes Tue, 26 Oct 2010 09:50:53 -0700

On Tue, 26 Oct 2010 17:34:44 +0100, Geoffrey Sneddon<[email protected]> wrote:

On 24/10/10 16:01, Gustavo André dos Santos Lopes wrote:
cataphract                               Sun, 24 Oct 2010 15:01:02 +0000
Revision: http://svn.php.net/viewvc?view=revision&revision=304705

Log:
 > […]
- For html_entity_decode(), only valid numerical and named entities (asdefined above for htmlentities()/htmlspecialchars() + !double_encode)are decoded.
But there is in this case one additional check. Entities that represent
non-SGML or otherwise invalid characters are not decoded. Note that, inHTML5, U+000D is a valid literal character, but the entity&#x0D is notvalid and is therefore not decoded.
This shouldn't be the behaviour.  is invalid for documentconformance, but what is conforming doesn't matter here. What mattershere is the parser conformance requirements, which for tokenizingcharacter references is<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references>.This makes is quite clear that  decodes to U+000D.
By not decoding  it's not a conforming HTML5 implementation, asthere are two allowed behaviours when reaching a parse error such as — stop processing entirely (effectively dropping the rest of thestring) or following the spec's behaviour for that parse error.
Also, do you decode "&amp" without a trailing semi-colon? That's equallyrequired to be decoded by HTML5.

Thanks for clearing this up. I didn't know the C1 entities should betranslated to something else either (this HTML5 spec has a lot ofcompatibility quirks...).


I will fix it in a near future.

I presume the implementation decodes the "&" in '<xmp>&</xmp>',which equally makes it non-conforming from an HTML5 POV…

That's the case, but html_entity_decode/htmlspecialchars are not reallydesigned for text that includes elements, they're for text nodes.

Bug: http://bugs.php.net/52860 (Open) htmlspecialchars/htmlentitiesstripping invalid characters
Ah, this shows the mistake: you follow the section 8.1 (writing HTMLdocuments) for character references instead of 8.2 (parsing HTMLdocuments), while when you're decoding entities you're parsing it…



--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to