Hi Dennis

On 04/09/2023 21:54, Dennis Snell wrote:
> Thanks for the proposal Niels,
> 
> I’ve dealt with my own grief working through issues in DOMDocument and 
> wanting it to work but finding it inadequate.
> 
>> HTML5
> 
> This would be a great starting point; I would love it if we took the 
> opportunity to fix named character reference decoding, as PHP has (to my 
> knowledge) never respected (at least in HTML5) that they decode differently 
> inside attributes as they do inside markup, considering rules such as the 
> ambiguous ampersand and decode errors.
> 
> It’s also been frustrating that DOMDocument parses tags in RCDATA sections 
> where they don’t exist, such as in TITLE or TEXTAREA elements, escapes 
> certain types of invalid comments so that they appear rendered in the saved 
> document, and misses basic semantic rules (e.g. creating a BUTTON element as 
> a child of a BUTTON element instead of closing out the already-open BUTTON).

With this proposal: a real HTML5 parser, these above mentioned problems will 
fortunately be a problem from the past :)

> 
> I’d like to share some what a few of us have been working on inside 
> WordPress, which is to build a conformant streaming HTML5 parser:
>  - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ 
> <https://developer.wordpress.org/reference/classes/wp_html_tag_processor/>
>  - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ 
> <https://make.wordpress.org/core/2023/08/19/progress-report-html-api/>
> 
> It’s just food for thought right now because adding HTML5 support to 
> DOMDocument would benefit everyone, but we decided we had common need in PHP 
> to work with HTML not in a DOM, but in a streaming fashion, one with very 
> little runtime overhead. My long-term plan has been to get a good grasp for 
> the interface needs and thoroughly test it within the WordPress community and 
> then propose its inclusion into PHP. It’s been incredibly handy so far, and 
> on my laptop runs at around 20 MB/s, which is not great, but good enough for 
> many needs. My naive C port runs on the same laptop at around 80 MB/s and I 
> believe that we can likely triple or quadruple that speed again if any of us 
> working on it knew how to take advantage of SIMD instrinsics.
> 
> It tries to accomplish a few goals:
>  - be fast enough
>  - interpret HTML as an HTML5-compliant browser will
>  - find specific locations within an HTML document and then read or modify 
> them
>  - pass through any invalid HTML it encounters for the browser to resolve/fix 
> unless modifying the part of the document containing those invalid 
> constructions
> 

I've seen someone link this on Reddit today, it's a really nice project!
It reminds me of Cloudflare's lol-html, which is also a streaming parser used 
to modify and sanitize documents linearly.
I believe this could be a great addition, it solves a different problem that 
the ext/dom extension solves. So I think it would be a great complementary 
addition.

> I only bring up this different interface because once we started digging deep 
> into DOMDocument we found that the problems with it were far from 
> superficial; that there is a host of problems and a mismatched interface to 
> our common needs. It has surprised me that PHP, the language of the web, has 
> had such trouble handling HTML, the language of the web, and we wanted to 
> completely resolve this issue once and for all within WordPress so we can 
> clean up decades’ old problems with encoding, decoding, security, and 
> sanitization.

Yes, I was also quite surprised of the lacking support for modern web features, 
and also the problems with spec compliance.
I only recently got into maintaining ext/dom. So there's still a lot of work to 
do.
I had already started with adding more DOM APIs in the 8.3 release cycle and 
plan to continue that effort in 8.4.
Another major project I want to do for 8.4, besides HTML5 support, is fixing 
the spec compliance issues in an opt-in manner. This would help with security & 
sanitization problems (HTML5 should help with the encoding&decoding).

> 
> Warmly,
> Dennis Snell

Kind regards
Niels

> 
>> On Sep 2, 2023, at 12:41 PM, Niels Dossche <dossche.ni...@gmail.com 
>> <mailto:dossche.ni...@gmail.com>> wrote:
>>
>> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
>> support".
>> https://wiki.php.net/rfc/domdocument_html5_parser 
>> <https://wiki.php.net/rfc/domdocument_html5_parser>
>>
>> Kind regards
>> Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to