> From: Gavin Smith <[email protected]> > Date: Sun, 20 Feb 2022 11:54:08 +0000 > > Strings coming from the Texinfo source file have to be assumed to represent > characters, not bytes, as the Texinfo source is read with a certain encoding. > File names, however, are a sequence of bytes (on GNU/Linux at least; on > MS-Windows it may be different). I believe it's this conflict > that is responsible.
File names are not bytes, they are characters as well, at least in most cases relevant to this discussion. That some filesystems are agnostic to the characters in the bytestream that is the file name doesn't change the basic fact that file names are created and viewed by humans, and humans need to see characters there. > I propose the following fix, which doesn't touch Perl's internal string > representation directly: > > diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm > index 29dbf3c8c3..7babba016c 100644 > --- a/tp/Texinfo/Common.pm > +++ b/tp/Texinfo/Common.pm > @@ -1507,6 +1507,8 @@ sub locate_include_file($$) > my $text = shift; > my $file; > > + utf8::encode($text); > + > my $ignore_include_directories = 0; > > my ($volume, $directories, $filename) = File::Spec->splitpath($text); > > This means that any non-ASCII characters in a filename in a Texinfo source > file are sought in the filesystem as the corresponding UTF-8 sequences. This will not work on Windows. > A more thorough fix would obey @documentencoding and convert back to the > original encoding, to retrieve the bytes that were present in the source > file in case the file was not in UTF-8. I think it would be the most > correct to always use the exact bytes that were in the source file as the > name of the file (I assume that is what TeX would do). This assumes that the file name is encoded the same as the Texinfo source. But that assumption is only true on the system where the Texinfo file was written, and even there it could be false. The only thorough solution, IMO, is to assume the file names are encoded in the filesystem as specified by the locale's codeset. That, too, can be false, but at least in the absolute majority of use cases it will be true. The only better solution is to let the user specify the file-name encoding.
