> Date: Wed, 3 Aug 2022 14:36:58 -0700 > From: Per Bothner <[email protected]> > > On 8/3/22 13:46, Patrice Dumas wrote: > > This is not what we do in general for html/xhtml. For epub we always > > emit utf8, as it is mandated by the standard, but for html/xhtml, we > > use, in the default case, the input encoding for the output encoding. > > I think that is a mistake. > It seems clear that in 2022 all publicly-visible html pages (i.e. on a public > web server) should use utf8. > It is also clear that a practical html-reading program is able to read > utf8-encoded > html files (assuming a correct charset declaration), regardless of the local > character encoding, even for local file: urls or an internal web-server. > Ergo, always emitting utf8 (with a charset declaration) is safer and very > unlikely to > lead to problems. while using a native or input-base encoding is fragile and > dangerous.
Isn't the main issue here about encoding _file_names_, and the encoding of HTML is secondary? I mean file names we produce from Texinfo sources, for files that are part of the output from texi2any processing. Encoding file names in UTF-8 is not always a good idea. At least on MS-Windows, that is currently not supported; the program (in this case, Perl and its extensions written in C) needs either (a) convert UTF-8 to UTF-16, and then call "wide" APIs that accept wchar_t strings, or (b) convert to the system codepage (which could be lossy). Otherwise functions that call 'open', 'fopen' and the likes will fail or will produce garbled file names. On other systems, if the locale's codeset is not UTF-8 (which is indeed rare nowadays, but not non-existent), encoding file names in UTF-8 will produce files whose names are unreadable by human users in applications that manipulate file names. So if we agree that encoding of file names we produce should not always be UTF-8, the next question is: how to encode those names in the produced Texinfo output when we need to reference such a file. It is possible to use an encoding in the produced output that is different from the actual encoding of file names on disk, but AFAIU the issue at hand was about the former first.
