On Sun, Feb 20, 2022 at 11:54:08AM +0000, Gavin Smith wrote: > I found it was the last argument to File::Spec->catdir that led to the > utf8 flag being on: $filename. This came from the argument to > locate_include_file, which came from the Texinfo source file. The following > also fixes it:
I do not think that the fact that it is utf8 is important, I believe that it is an internal design choice in perl what matter is that it is in the internal perl unicode encoding. > diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm > index 29dbf3c8c3..36be8c5b59 100644 > --- a/tp/Texinfo/Common.pm > +++ b/tp/Texinfo/Common.pm > @@ -1507,6 +1507,8 @@ sub locate_include_file($$) > my $text = shift; > my $file; > > + utf8::downgrade($text); > + > my $ignore_include_directories = 0; > > my ($volume, $directories, $filename) = File::Spec->splitpath($text); > > > This may be surprising as the non-ASCII characters were not in $text itself: > $text was just "include.texi". The non-ASCII characters in the include path > got to this function without the utf8 flag going on. Again, I do not think that we should rely on the specific encoding of a string. We should only track whether it is interal perl unicode string or bytes. > Strings coming from the Texinfo source file have to be assumed to represent > characters, not bytes, as the Texinfo source is read with a certain encoding. > File names, however, are a sequence of bytes (on GNU/Linux at least; on > MS-Windows it may be different). I believe it's this conflict > that is responsible. I agree, that's also my interpretation. It is the same on MS-Windows. > I propose the following fix, which doesn't touch Perl's internal string > representation directly: > > diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm > index 29dbf3c8c3..7babba016c 100644 > --- a/tp/Texinfo/Common.pm > +++ b/tp/Texinfo/Common.pm > @@ -1507,6 +1507,8 @@ sub locate_include_file($$) > my $text = shift; > my $file; > > + utf8::encode($text); > + > my $ignore_include_directories = 0; > > my ($volume, $directories, $filename) = File::Spec->splitpath($text); > > This means that any non-ASCII characters in a filename in a Texinfo source > file are sought in the filesystem as the corresponding UTF-8 sequences. I think that the correct way to do that is to use Encode::encode($text, 'utf-8'); Also I think that it should be done as late as possible, so it would be better on $possible_file. > A more thorough fix would obey @documentencoding and convert back to the > original encoding, to retrieve the bytes that were present in the source > file in case the file was not in UTF-8. I think it would be the most > correct to always use the exact bytes that were in the source file as the > name of the file (I assume that is what TeX would do). I do not think so, at least not on Linux, as in Linux the files are always encoded as UTF-8. So encoding in UTF-8 seems to always be better. It also matches with the XS parser which converts to UTF-8. This may be incorrect on other platforms, such as windows or mac, however. -- Pat
