Hi Dennis On 04/09/2023 21:54, Dennis Snell wrote: > Thanks for the proposal Niels, > > I’ve dealt with my own grief working through issues in DOMDocument and > wanting it to work but finding it inadequate. > >> HTML5 > > This would be a great starting point; I would love it if we took the > opportunity to fix named character reference decoding, as PHP has (to my > knowledge) never respected (at least in HTML5) that they decode differently > inside attributes as they do inside markup, considering rules such as the > ambiguous ampersand and decode errors. > > It’s also been frustrating that DOMDocument parses tags in RCDATA sections > where they don’t exist, such as in TITLE or TEXTAREA elements, escapes > certain types of invalid comments so that they appear rendered in the saved > document, and misses basic semantic rules (e.g. creating a BUTTON element as > a child of a BUTTON element instead of closing out the already-open BUTTON).
With this proposal: a real HTML5 parser, these above mentioned problems will fortunately be a problem from the past :) > > I’d like to share some what a few of us have been working on inside > WordPress, which is to build a conformant streaming HTML5 parser: > - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ > <https://developer.wordpress.org/reference/classes/wp_html_tag_processor/> > - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ > <https://make.wordpress.org/core/2023/08/19/progress-report-html-api/> > > It’s just food for thought right now because adding HTML5 support to > DOMDocument would benefit everyone, but we decided we had common need in PHP > to work with HTML not in a DOM, but in a streaming fashion, one with very > little runtime overhead. My long-term plan has been to get a good grasp for > the interface needs and thoroughly test it within the WordPress community and > then propose its inclusion into PHP. It’s been incredibly handy so far, and > on my laptop runs at around 20 MB/s, which is not great, but good enough for > many needs. My naive C port runs on the same laptop at around 80 MB/s and I > believe that we can likely triple or quadruple that speed again if any of us > working on it knew how to take advantage of SIMD instrinsics. > > It tries to accomplish a few goals: > - be fast enough > - interpret HTML as an HTML5-compliant browser will > - find specific locations within an HTML document and then read or modify > them > - pass through any invalid HTML it encounters for the browser to resolve/fix > unless modifying the part of the document containing those invalid > constructions > I've seen someone link this on Reddit today, it's a really nice project! It reminds me of Cloudflare's lol-html, which is also a streaming parser used to modify and sanitize documents linearly. I believe this could be a great addition, it solves a different problem that the ext/dom extension solves. So I think it would be a great complementary addition. > I only bring up this different interface because once we started digging deep > into DOMDocument we found that the problems with it were far from > superficial; that there is a host of problems and a mismatched interface to > our common needs. It has surprised me that PHP, the language of the web, has > had such trouble handling HTML, the language of the web, and we wanted to > completely resolve this issue once and for all within WordPress so we can > clean up decades’ old problems with encoding, decoding, security, and > sanitization. Yes, I was also quite surprised of the lacking support for modern web features, and also the problems with spec compliance. I only recently got into maintaining ext/dom. So there's still a lot of work to do. I had already started with adding more DOM APIs in the 8.3 release cycle and plan to continue that effort in 8.4. Another major project I want to do for 8.4, besides HTML5 support, is fixing the spec compliance issues in an opt-in manner. This would help with security & sanitization problems (HTML5 should help with the encoding&decoding). > > Warmly, > Dennis Snell Kind regards Niels > >> On Sep 2, 2023, at 12:41 PM, Niels Dossche <dossche.ni...@gmail.com >> <mailto:dossche.ni...@gmail.com>> wrote: >> >> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization >> support". >> https://wiki.php.net/rfc/domdocument_html5_parser >> <https://wiki.php.net/rfc/domdocument_html5_parser> >> >> Kind regards >> Niels -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php