Re: Non-ASCII characters in @include search path

Patrice Dumas Sat, 26 Feb 2022 09:50:29 -0800

On Sat, Feb 26, 2022 at 04:29:41PM +0000, Gavin Smith wrote:
> > * unless I missed something, string constants are character strings. If
> >   thay are to appear mostly in file names we need to encode them at some
> >   point, but it does not seems to be easy to me to decide when, unless
> >   when we are sure that the string will only be considered as a byte
> >   sequence from then on.
> 
> If string constants in the Perl source code are purely ASCII then there
> is no problem.  They can be used in error messages, inside the output
> files, or used to open files on the filesystem.


Not used to open files on the filesystem if combined with encoded
strings, they need to be combined as character strings and then encoded,
or to be encoded and combined as encoded bytes sequences.

> For example, in HTML.pm, TOP_FILE is set as 'index.html'.  This can
> be used in hyperlinks to that file as well as to create it.

I think that it is not enough.  It seems to me that, if combined with
non ascii encoded byte strings, there can still be 'upgrading' of the
encoded string which will mess the encoding, because of some mixiing
with decoded string, even if ascii.  I have not really understood when
this happened and when this did not happen, I can't really understand
why there are problems some time and not some other time.  And it is
not documented very precisely, probably because the perl developers do
not want to be tied to an implementation. The more explicit I found,
is here, and it is not that explicit:

 https://perldoc.perl.org/perlunifaq#What-if-I-don't-decode?

and the next question, about not encoding where there is:

 "Because the internal format is often UTF-8, these bugs are hard to spot, 
because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't 
use the fact that Perl's internal format is UTF-8 to your advantage. Encode 
explicitly to avoid weird bugs, and to show to maintenance programmers that you 
thought this through."

To follow that advice, we need to decode and encode everything even if
ascii.

For an example, in the following there are only ascii strings, except -o
encodé/ which is not decoded, and the result is that the é in encodé
ends up not being correctly output:

$ cat test_smth.texi 
\input texinfo

@setfilename test_smth.info

@top top
@node Top

@bye

$ ./texi2any.pl -o encodé/ test_smth.texi

$ ls -d encod*
encodÃ©

It may be possible to fix this issue by looking at all the places where
the SUBDIR or OUTPUT customization variable associated string interact,
encode all the strings they interact with, also re-decode them if needed
for error messages, or inclusion in output documents.  However, the
other option, decode everything and encode when we need to interact with
the outside of the code seems to me to be much simpler, require much
less time and thinking and is much less error prone.

> > * many strings are used both in file names and in texts.  For example
> >   the customization variable 'EXTENSION'.  Even strings that are almost
> >   only used as bytes can appear in error messages, which means that we
> >   need to keep the information somewhere on how to decode them.
> 
> It is no problem as long as the EXTENSION string is purely ASCII.

I do not think so.  I think that it needs to be encoded if mixed with
non ascii strings.  (Also, it could be set to something non ascii, as
customization but this should be pretty rare).

> > * many strings can come from documents, as character strings or from
> >   command line, possibly kept encoded.  For example document file name
> >   can come from @setfilename or the command line (or customization
> >   variable).
> 
> This is a bigger problem as the filename could be non-ASCII, unlike
> the extension.
> 
> I will try to understand the code and run some tests after I install
> a non-UTF-8 locale.

You don't need a non-UTF-8 locale for the issue above, or for the issue
that prompted me to try to look seriously at the issue, which is
tests/formatting/list-of-tests non_ascii_test_epub. Having an accented
letter in the document name makes it very hard to determine what should
be encoded/decoded in init/epub3.pm and upstream code, in particular in
Texinfo/Convert/Converter.pm determine_files_and_directory(), but
although I thought previously that it could be solved in that function
only, it is not so simple, strings come from everywhere in
init/epub3.pm.

> > * it is much more simpler to require customization variables from init
> >   files to be character strings, which means that we need an API to
> >   encode those we want to mix with bytes, and we cannot do this early so
> >   it means more complexity.
> > 
> > For all those reasons, I really think that we should use character
> > strings almost everywhere and encode when needed, such that there is
> > no need to track down where a string comes from to be sure whether it
> > is encoded or not.  We already decode and encode in many places as we
> > have file names used in error messages combined with character strings,
> > character strings from Texinfo manuals that need to be encoded.  The
> > gain of avoiding to decode and encode a few strings is not covered, in
> > my opinion by the complexity of having strings that cannot be mixed.
> > 
> > In some cases, we can decide to consider encoded strings, still, but I
> > think that it should only be if we are sure that they will not ever be
> > mixed with decoded character strings.
> 
> I hope the complexity in dealing with filename encodings can be kept to a
> minimum.  Doing it the way you say might be simpler but we should check that
> a few use cases worked.  I want to see if any issues can be fixed with the
> existing approach.

If you can solve the two issues above with the existing approach I
could reconsider my position, but I tried to fix the init/epub3.pm
and I started to follow every string origin and it became too much work,
and, in addition, for many strings it was hard to decide if the best
would be to encode them or not, already in epub3.pm, not considering
something more complex, like HTML.pm + Converter.pm.

-- 
Pat

Re: Non-ASCII characters in @include search path

Reply via email to