Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Tim Düsterhus Fri, 29 Sep 2023 09:07:09 -0700

Hi

On 9/29/23 17:45, Niels Dossche wrote:

Right, we follow the HTML spec in this regard. Roughly speaking we determine 
the charset in the following order of priorities.
If one option fails, it will fall through to the next one.
1. The Content-Type HTTP header from which you loaded the document.

How would the new document classes make use of that? The HTTP header istransmitted out-of-band with regard to the actual payload.

Is this referring to passing a `http://` path toHTMLDocument::createFromFile()? This would be unusable for everyone whomanually downloads the document, e.g. using a PSR-18 HTTP Client.

It might actually be necessary to add an encoding parameter to thesefunctions, but it would need to take priority over anything implicit.The current $encoding of the global \DOMDocument has the problem that itdoesn't take priority/is ignored entirely. Manually converting thedocument to UTF-8 before passing it to \DOMDocument has the problem thatthe meta tag in the document takes priority.

In fact I've run into this issue before for the implementation of a richembed feature. We're downloading the websites using Guzzle and attemptto make sense of them with \DOMDocument. However we can't reliably forcethe encoding given within the 'content-type' response header, so in somecases we obtain mojibake.

This encoding parameter would likely need to be `?string $encoding =null` with everything non-null overwriting implicit detection and nullmeaning implicit detection in the order of priorities you mentioned.

2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend 
the content with byte markers. This is used to detect encoding.
3. Meta tag in the content.


Best regards
Tim Düsterhus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to