On 8/3/22 13:46, Patrice Dumas wrote:
This is not what we do in general for html/xhtml. For epub we always emit utf8, as it is mandated by the standard, but for html/xhtml, we use, in the default case, the input encoding for the output encoding.
I think that is a mistake. It seems clear that in 2022 all publicly-visible html pages (i.e. on a public web server) should use utf8. It is also clear that a practical html-reading program is able to read utf8-encoded html files (assuming a correct charset declaration), regardless of the local character encoding, even for local file: urls or an internal web-server. Ergo, always emitting utf8 (with a charset declaration) is safer and very unlikely to lead to problems. while using a native or input-base encoding is fragile and dangerous.
The conversion should not have already been done at that point, we are still character strings in internal perl unicode encoding. But that was not really myquestion, my question was more on whether we should use the output encoding to encode string before doing the URI::Escape call, or always use UTF-8, even if the document encoding is not UTF-8.
The question is irrelevant: we should always emit utf8 in both urls and in the body of html/xhtml files. That should certainly be the default (regardless of native or input encoding) - and it is almost certainly a waste of time to support anything else. Here is another datapoint: https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier#Compatibility -- --Per Bothner [email protected] http://per.bothner.com/
