Re: url protection

Per Bothner Wed, 03 Aug 2022 14:37:39 -0700

On 8/3/22 13:46, Patrice Dumas wrote:

This is not what we do in general for html/xhtml.  For epub we always
emit utf8, as it is mandated by the standard, but for html/xhtml, we
use, in the default case, the input encoding for the output encoding.


I think that is a mistake.
It seems clear that in 2022 all publicly-visible html pages (i.e. on a public
web server) should use utf8.
It is also clear that a practical html-reading program is able to read 
utf8-encoded
html files (assuming a correct charset declaration), regardless of the local
character encoding, even for local file: urls or an internal web-server.
Ergo, always emitting utf8 (with a charset declaration) is safer and very 
unlikely to
lead to problems. while using a native or input-base encoding is fragile and 
dangerous.

The conversion should not have already been done at that point, we are
still character strings in internal perl unicode encoding.  But that was
not really myquestion, my question was more on whether we should use the
output encoding to encode string before doing the URI::Escape call, or
always use UTF-8, even if the document encoding is not UTF-8.


The question is irrelevant: we should always emit utf8 in both urls and in the 
body
of html/xhtml files.  That should certainly be the default (regardless of
native or input encoding) - and it is almost certainly a waste of time to
support anything else.

Here is another datapoint:
https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier#Compatibility
--
        --Per Bothner
[email protected]   http://per.bothner.com/

Re: url protection

Reply via email to