On Sat, Feb 26, 2022 at 04:29:41PM +0000, Gavin Smith wrote: > > * unless I missed something, string constants are character strings. If > > thay are to appear mostly in file names we need to encode them at some > > point, but it does not seems to be easy to me to decide when, unless > > when we are sure that the string will only be considered as a byte > > sequence from then on. > > If string constants in the Perl source code are purely ASCII then there > is no problem. They can be used in error messages, inside the output > files, or used to open files on the filesystem.
Not used to open files on the filesystem if combined with encoded strings, they need to be combined as character strings and then encoded, or to be encoded and combined as encoded bytes sequences. > For example, in HTML.pm, TOP_FILE is set as 'index.html'. This can > be used in hyperlinks to that file as well as to create it. I think that it is not enough. It seems to me that, if combined with non ascii encoded byte strings, there can still be 'upgrading' of the encoded string which will mess the encoding, because of some mixiing with decoded string, even if ascii. I have not really understood when this happened and when this did not happen, I can't really understand why there are problems some time and not some other time. And it is not documented very precisely, probably because the perl developers do not want to be tied to an implementation. The more explicit I found, is here, and it is not that explicit: https://perldoc.perl.org/perlunifaq#What-if-I-don't-decode? and the next question, about not encoding where there is: "Because the internal format is often UTF-8, these bugs are hard to spot, because UTF-8 is usually the encoding you wanted! But don't be lazy, and don't use the fact that Perl's internal format is UTF-8 to your advantage. Encode explicitly to avoid weird bugs, and to show to maintenance programmers that you thought this through." To follow that advice, we need to decode and encode everything even if ascii. For an example, in the following there are only ascii strings, except -o encodé/ which is not decoded, and the result is that the é in encodé ends up not being correctly output: $ cat test_smth.texi \input texinfo @setfilename test_smth.info @top top @node Top @bye $ ./texi2any.pl -o encodé/ test_smth.texi $ ls -d encod* encodé It may be possible to fix this issue by looking at all the places where the SUBDIR or OUTPUT customization variable associated string interact, encode all the strings they interact with, also re-decode them if needed for error messages, or inclusion in output documents. However, the other option, decode everything and encode when we need to interact with the outside of the code seems to me to be much simpler, require much less time and thinking and is much less error prone. > > * many strings are used both in file names and in texts. For example > > the customization variable 'EXTENSION'. Even strings that are almost > > only used as bytes can appear in error messages, which means that we > > need to keep the information somewhere on how to decode them. > > It is no problem as long as the EXTENSION string is purely ASCII. I do not think so. I think that it needs to be encoded if mixed with non ascii strings. (Also, it could be set to something non ascii, as customization but this should be pretty rare). > > * many strings can come from documents, as character strings or from > > command line, possibly kept encoded. For example document file name > > can come from @setfilename or the command line (or customization > > variable). > > This is a bigger problem as the filename could be non-ASCII, unlike > the extension. > > I will try to understand the code and run some tests after I install > a non-UTF-8 locale. You don't need a non-UTF-8 locale for the issue above, or for the issue that prompted me to try to look seriously at the issue, which is tests/formatting/list-of-tests non_ascii_test_epub. Having an accented letter in the document name makes it very hard to determine what should be encoded/decoded in init/epub3.pm and upstream code, in particular in Texinfo/Convert/Converter.pm determine_files_and_directory(), but although I thought previously that it could be solved in that function only, it is not so simple, strings come from everywhere in init/epub3.pm. > > * it is much more simpler to require customization variables from init > > files to be character strings, which means that we need an API to > > encode those we want to mix with bytes, and we cannot do this early so > > it means more complexity. > > > > For all those reasons, I really think that we should use character > > strings almost everywhere and encode when needed, such that there is > > no need to track down where a string comes from to be sure whether it > > is encoded or not. We already decode and encode in many places as we > > have file names used in error messages combined with character strings, > > character strings from Texinfo manuals that need to be encoded. The > > gain of avoiding to decode and encode a few strings is not covered, in > > my opinion by the complexity of having strings that cannot be mixed. > > > > In some cases, we can decide to consider encoded strings, still, but I > > think that it should only be if we are sure that they will not ever be > > mixed with decoded character strings. > > I hope the complexity in dealing with filename encodings can be kept to a > minimum. Doing it the way you say might be simpler but we should check that > a few use cases worked. I want to see if any issues can be fixed with the > existing approach. If you can solve the two issues above with the existing approach I could reconsider my position, but I tried to fix the init/epub3.pm and I started to follow every string origin and it became too much work, and, in addition, for many strings it was hard to decide if the best would be to encode them or not, already in epub3.pm, not considering something more complex, like HTML.pm + Converter.pm. -- Pat
