Hi Tim

On 29/09/2023 18:06, Tim Düsterhus wrote:
> Hi
> 
> On 9/29/23 17:45, Niels Dossche wrote:
>> Right, we follow the HTML spec in this regard. Roughly speaking we determine 
>> the charset in the following order of priorities.
>> If one option fails, it will fall through to the next one.
>> 1. The Content-Type HTTP header from which you loaded the document.
> 
> How would the new document classes make use of that? The HTTP header is 
> transmitted out-of-band with regard to the actual payload.
> 
> Is this referring to passing a `http://` path to 
> HTMLDocument::createFromFile()? This would be unusable for everyone who 
> manually downloads the document, e.g. using a PSR-18 HTTP Client.

When the stream wrapper contains header information that information is used 
indeed.
That would unfortunately indeed mean it's unusable when manually passed in.

> 
> It might actually be necessary to add an encoding parameter to these 
> functions, but it would need to take priority over anything implicit. The 
> current $encoding of the global \DOMDocument has the problem that it doesn't 
> take priority/is ignored entirely. Manually converting the document to UTF-8 
> before passing it to \DOMDocument has the problem that the meta tag in the 
> document takes priority.
> 
> In fact I've run into this issue before for the implementation of a rich 
> embed feature. We're downloading the websites using Guzzle and attempt to 
> make sense of them with \DOMDocument. However we can't reliably force the 
> encoding given within the 'content-type' response header, so in some cases we 
> obtain mojibake.
> 
> This encoding parameter would likely need to be `?string $encoding = null` 
> with everything non-null overwriting implicit detection and null meaning 
> implicit detection in the order of priorities you mentioned.

I agree.
I'll add the optional arguments `?string $override_encoding = null` to 
XML/HTMLDocument::createFromString and XML/HTMLDocument::createFromFile.
I'd call it override_encoding to emphasize it's about overriding the behaviour.

> 
>> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend 
>> the content with byte markers. This is used to detect encoding.
>> 3. Meta tag in the content.
> 
> Best regards
> Tim Düsterhus

Kinds regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to