On Sun, Feb 20, 2022 at 05:27:51PM +0000, Gavin Smith wrote: > > For example, in the following the file name is output correctly as it is > > not decoded, but the string from the Texinfo file is decoded but not > > encoded and hence ends up incorrect in the message. Decoding everything > > and then encoding the error messages should allow to mix strings from > > different sources and different encodings. > > > > $ ./texi2any.pl testé.texi > > testé.texi:8: warning: node `�sseul�' unreferenced > > Suppose the translation for the word "node" was non-ASCII. I'd expect > the translation for that word to be encoded correctly in the output, even > if the node name weren't. > > I haven't been able to test it yet but there is a translation in French: > > #: tp/Texinfo/Structuring.pm:429 > #, perl-format > msgid "node `%s' unreferenced" > msgstr "nœud « %s » non référencé" > > If the error message became something like > > "nœud « �sseul� » non référencé"
Which is the case: testé.texi:8: warning: nœud « �sseul� » non référencé > then encoding this to UTF-8 would break the parts which already were in > UTF-8. Indeed. > The only way out would seem to be different use of the gettext functions. > > I don't see that there is an option in Locale::Messages or Locale::TextDomain > to get "unencoded" output, that is in Perl's internal string format. The > closest that could be done is to always output to UTF-8, possibly set the > UTF-8 > flag on the resulting string, and then convert this to the final message > encoding at the end. I think that there is another way, which is actually already used in tp/Texinfo/Translations.pm in gdt, we already decode all the translated messages, using bind_textdomain_filter and Encode::decode(). We could do the same for error messages translations, as long as we know the locale encoding. > So my best idea at the moment for fixing the encoding of the error messages > is: > * When calling gettext and related functions, always demand UTF-8, and convert > this back into Perl's internal coding afterwards. Why demand utf-8, any encoding should be ok? > * Convert the messages at the time they are output. > > For example, if a node name is in EUC-JP, this would be converted (internally) > into UTF-8 when the file is read. Why not converted to the internal perl encoding? > The node name would then be easily > interpolable into a UTF-8 error message. If the user actually wanted error > messages to be printed in EUC-JP, then the whole error message would be > output at the end. > > As far as filename encoding goes, I suspect that use of filenames in messages > is something that is limited in the source code so decoding of filenames > may be something that can be limited. > > > > > testé.texi:8: warning: node `' unreferenced > > > > > > $ cat testé.texi > > \input texinfo.tex > > > > @setfilename testé.info > > > > @node Top > > @top Testé > > > > @node ésseulé > > > > @node Chapitré > > @chapter Chapitré
