Hi Tim On 29/09/2023 18:06, Tim Düsterhus wrote: > Hi > > On 9/29/23 17:45, Niels Dossche wrote: >> Right, we follow the HTML spec in this regard. Roughly speaking we determine >> the charset in the following order of priorities. >> If one option fails, it will fall through to the next one. >> 1. The Content-Type HTTP header from which you loaded the document. > > How would the new document classes make use of that? The HTTP header is > transmitted out-of-band with regard to the actual payload. > > Is this referring to passing a `http://` path to > HTMLDocument::createFromFile()? This would be unusable for everyone who > manually downloads the document, e.g. using a PSR-18 HTTP Client.
When the stream wrapper contains header information that information is used indeed. That would unfortunately indeed mean it's unusable when manually passed in. > > It might actually be necessary to add an encoding parameter to these > functions, but it would need to take priority over anything implicit. The > current $encoding of the global \DOMDocument has the problem that it doesn't > take priority/is ignored entirely. Manually converting the document to UTF-8 > before passing it to \DOMDocument has the problem that the meta tag in the > document takes priority. > > In fact I've run into this issue before for the implementation of a rich > embed feature. We're downloading the websites using Guzzle and attempt to > make sense of them with \DOMDocument. However we can't reliably force the > encoding given within the 'content-type' response header, so in some cases we > obtain mojibake. > > This encoding parameter would likely need to be `?string $encoding = null` > with everything non-null overwriting implicit detection and null meaning > implicit detection in the order of priorities you mentioned. I agree. I'll add the optional arguments `?string $override_encoding = null` to XML/HTMLDocument::createFromString and XML/HTMLDocument::createFromFile. I'd call it override_encoding to emphasize it's about overriding the behaviour. > >> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend >> the content with byte markers. This is used to detect encoding. >> 3. Meta tag in the content. > > Best regards > Tim Düsterhus Kinds regards Niels -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php