On Wed, Aug 03, 2022 at 02:36:58PM -0700, Per Bothner wrote: > On 8/3/22 13:46, Patrice Dumas wrote: > > This is not what we do in general for html/xhtml. For epub we always > > emit utf8, as it is mandated by the standard, but for html/xhtml, we > > use, in the default case, the input encoding for the output encoding. > > I think that is a mistake. > It seems clear that in 2022 all publicly-visible html pages (i.e. on a public > web server) should use utf8. > It is also clear that a practical html-reading program is able to read > utf8-encoded > html files (assuming a correct charset declaration), regardless of the local > character encoding, even for local file: urls or an internal web-server. > Ergo, always emitting utf8 (with a charset declaration) is safer and very > unlikely to > lead to problems. while using a native or input-base encoding is fragile and > dangerous.
I agree that UTF-8 is the way to go for the future, and the default output encoding could be set to UTF-8 irrespective of input encoding for HTML, and even more for XML based formats. I do not have a specific opinion on that matter, and I defer to Gavin on that matter. Also, my wild guess, although I haven't tested, is that a browser, without any charset information, for a local file, should use the locale encoding. In any case, it does not mean that using another encoding is fragile nor dangerous. There is a list of supported encodings in the Texinfo manual https://www.gnu.org/software/texinfo/manual/texinfo/html_node/_0040documentencoding.html I think that we support them well, in a robust way in texi2any. And if it is not the case, it should be a bug. We always emit a charset information, too. Also this is quite off-topic, we can discuss the default output encoding for HTML, but it should not be in that thread. > > The conversion should not have already been done at that point, we are > > still character strings in internal perl unicode encoding. But that was > > not really myquestion, my question was more on whether we should use the > > output encoding to encode string before doing the URI::Escape call, or > > always use UTF-8, even if the document encoding is not UTF-8. > > The question is irrelevant: we should always emit utf8 in both urls and in > the body > of html/xhtml files. That should certainly be the default (regardless of > native or input encoding) - and it is almost certainly a waste of time to > support anything else. I think that we should support setting the output encoding explictly to a Texinfo supported encoding for a long time, even it UTF-8 becomes the default output encoding for HTML. I do not imagine dropping that feature anytime soon. This question will therefore be relevant for this setup for a long time, too. -- Pat
