Hi

On 9/29/23 17:45, Niels Dossche wrote:
Right, we follow the HTML spec in this regard. Roughly speaking we determine 
the charset in the following order of priorities.
If one option fails, it will fall through to the next one.
1. The Content-Type HTTP header from which you loaded the document.

How would the new document classes make use of that? The HTTP header is transmitted out-of-band with regard to the actual payload.

Is this referring to passing a `http://` path to HTMLDocument::createFromFile()? This would be unusable for everyone who manually downloads the document, e.g. using a PSR-18 HTTP Client.

It might actually be necessary to add an encoding parameter to these functions, but it would need to take priority over anything implicit. The current $encoding of the global \DOMDocument has the problem that it doesn't take priority/is ignored entirely. Manually converting the document to UTF-8 before passing it to \DOMDocument has the problem that the meta tag in the document takes priority.

In fact I've run into this issue before for the implementation of a rich embed feature. We're downloading the websites using Guzzle and attempt to make sense of them with \DOMDocument. However we can't reliably force the encoding given within the 'content-type' response header, so in some cases we obtain mojibake.

This encoding parameter would likely need to be `?string $encoding = null` with everything non-null overwriting implicit detection and null meaning implicit detection in the order of priorities you mentioned.

2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend 
the content with byte markers. This is used to detect encoding.
3. Meta tag in the content.

Best regards
Tim Düsterhus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to