Re: Non-ASCII characters in @include search path

Eli Zaretskii Wed, 23 Feb 2022 11:52:20 -0800

> From: Gavin Smith <[email protected]>
> Date: Wed, 23 Feb 2022 19:31:52 +0000
> 
> Whatever we do, it should be concordant with TeX's filename handling.
> I imagine that TeX (except possibly on MS-Windows) would just use the
> bytes, so so should we.


AFAIK, TeX uses bytes everywhere.

> In any case the cases we are dealing with a very rare here, but I just
> don't see that the situation is very common where somebody works in
> a non-UTF-8 locale, has all their filenames in this encoding, and
> recodes any files they download from the Internet or extracted from a tar
> file into that encoding.

If the file names are non-ASCII, the _only_ reasonable way of
downloading them is to recode their names.  Otherwise, you will get
garbled names at best, and at worst (on MS-Windows) can have the file
names rejected by the OS, i.e. you will be unable to unpack the
downloaded archive or to save locally the fetched file.

> It seems much more likely to me that somebody would be using a
> non-UTF-8 locale for whatever reason, and would download Texinfo
> files with UTF-8 names without recoding the names, and still
> expect to be able to build them.

This might simply fail on MS-Windows, if the UTF-8 byte sequences
include bytes that don't exist in the locale's encoding (a.k.a. "ANSI
codepage").  It will definitely produced garbled file names, and might
also break makeinfo.

> E.g. - UTF-8 Texinfo file, processed under KOI-8 locale on Windows,
> accessing filenames named with UTF-16 filenames on Windows filesystem.
> Then the UTF-8 filenames would be encoded to KOI-8, and then some file
> access layer would convert the KOI-8 to UTF-16 and find the filenames.
> Is that how it works or am I way off?

Are you describing what we will do in makeinfo, or are you describing
how the current makeinfo, which doesn't re-encode file names, works?

If the latter, then Windows file-related APIs will assume that the
file names we pass to them (taken from the Texinfo source's @include
or @image directives) are KOI-8 encoded, and will attempt to convert
the UTF-8 byte sequences to UTF-16 as if they were KOI-8 encoded.  The
results will never be pretty, and if some byte doesn't exist in the
KOI-8 encoding, the conversion will yield a question mark '?' or a
space character; in the former case, the API call will likely fail
because '?' is not allowed in Windows file names.

Re: Non-ASCII characters in @include search path

Reply via email to