> On Sep 4, 2023, at 1:15 PM, Niels Dossche <dossche.ni...@gmail.com> wrote:
>
> On 04/09/2023 21:54, Dennis Snell wrote:
>> Thanks for the proposal Niels,
>>
>> I’ve dealt with my own grief working through issues in DOMDocument and
>> wanting it to work but finding it inadequate.
>>
>>> HTML5
>>
>> This would be a great starting point; I would love it if we took the
>> opportunity to fix named character reference decoding, as PHP has (to my
>> knowledge) never respected (at least in HTML5) that they decode differently
>> inside attributes as they do inside markup, considering rules such as the
>> ambiguous ampersand and decode errors.
>>
>> It’s also been frustrating that DOMDocument parses tags in RCDATA sections
>> where they don’t exist, such as in TITLE or TEXTAREA elements, escapes
>> certain types of invalid comments so that they appear rendered in the saved
>> document, and misses basic semantic rules (e.g. creating a BUTTON element as
>> a child of a BUTTON element instead of closing out the already-open BUTTON).
>
> With this proposal: a real HTML5 parser, these above mentioned problems will
> fortunately be a problem from the past :)
Awesome. Makes me happy as long as we’re looking at a wholesale replacement of
the foundations upon which `DOMDocument` are built. My comment was mostly to
point out that there are levels to the inadequacy of `DOMDocument`; or phrased
differently, I support diverging from the `DOMDocument` class and parser and
even the interface. Making a break from the expectations of the existing one
could be nice to signal that it’s different, though I see that full backwards
compatibility is important to you.
>
>>
>> I’d like to share some what a few of us have been working on inside
>> WordPress, which is to build a conformant streaming HTML5 parser:
>> - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/
>> <https://developer.wordpress.org/reference/classes/wp_html_tag_processor/>
>> - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/
>> <https://make.wordpress.org/core/2023/08/19/progress-report-html-api/>
>>
>> It’s just food for thought right now because adding HTML5 support to
>> DOMDocument would benefit everyone, but we decided we had common need in PHP
>> to work with HTML not in a DOM, but in a streaming fashion, one with very
>> little runtime overhead. My long-term plan has been to get a good grasp for
>> the interface needs and thoroughly test it within the WordPress community
>> and then propose its inclusion into PHP. It’s been incredibly handy so far,
>> and on my laptop runs at around 20 MB/s, which is not great, but good enough
>> for many needs. My naive C port runs on the same laptop at around 80 MB/s
>> and I believe that we can likely triple or quadruple that speed again if any
>> of us working on it knew how to take advantage of SIMD instrinsics.
>>
>> It tries to accomplish a few goals:
>> - be fast enough
>> - interpret HTML as an HTML5-compliant browser will
>> - find specific locations within an HTML document and then read or modify
>> them
>> - pass through any invalid HTML it encounters for the browser to
>> resolve/fix unless modifying the part of the document containing those
>> invalid constructions
>>
>
> I've seen someone link this on Reddit today, it's a really nice project!
> It reminds me of Cloudflare's lol-html, which is also a streaming parser used
> to modify and sanitize documents linearly.
> I believe this could be a great addition, it solves a different problem that
> the ext/dom extension solves. So I think it would be a great complementary
> addition.
Unfortunately we only found the Cloudflare project after building our “Tag
Processor” but the similarities are striking. Having this kind of interface
inside PHP would do wonders for the WordPress world, and I think it would be
great for many other projects.
>
>> I only bring up this different interface because once we started digging
>> deep into DOMDocument we found that the problems with it were far from
>> superficial; that there is a host of problems and a mismatched interface to
>> our common needs. It has surprised me that PHP, the language of the web, has
>> had such trouble handling HTML, the language of the web, and we wanted to
>> completely resolve this issue once and for all within WordPress so we can
>> clean up decades’ old problems with encoding, decoding, security, and
>> sanitization.
>
> Yes, I was also quite surprised of the lacking support for modern web
> features, and also the problems with spec compliance.
> I only recently got into maintaining ext/dom. So there's still a lot of work
> to do.
> I had already started with adding more DOM APIs in the 8.3 release cycle and
> plan to continue that effort in 8.4.
> Another major project I want to do for 8.4, besides HTML5 support, is fixing
> the spec compliance issues in an opt-in manner. This would help with security
> & sanitization problems (HTML5 should help with the encoding&decoding).
>
>>
>> Warmly,
>> Dennis Snell
>
> Kind regards
> Niels
>
>>
>>> On Sep 2, 2023, at 12:41 PM, Niels Dossche <dossche.ni...@gmail.com
>>> <mailto:dossche.ni...@gmail.com>> wrote:
>>>
>>> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization
>>> support".
>>> https://wiki.php.net/rfc/domdocument_html5_parser
>>> <https://wiki.php.net/rfc/domdocument_html5_parser>
>>>
>>> Kind regards
>>> Niels
>
Impressive proposal. It will be nice to have. Did you consider any tricks for
text encoding, such as converting non-utf8 documents into utf8 first before
parsing? Was wondering if we did that if we could lean on `iconv` and save the
extra data in the library, if that’s important enough.
Cheers,
Dennis Snell
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php