Re: Non-ASCII characters in @include search path

Patrice Dumas Sun, 20 Feb 2022 10:35:03 -0800

On Sun, Feb 20, 2022 at 05:27:51PM +0000, Gavin Smith wrote:
> > For example, in the following the file name is output correctly as it is
> > not decoded, but the string from the Texinfo file is decoded but not
> > encoded and hence ends up incorrect in the message.  Decoding everything
> > and then encoding the error messages should allow to mix strings from
> > different sources and different encodings.
> > 
> > $ ./texi2any.pl testé.texi
> > testé.texi:8: warning: node `�sseul�' unreferenced
> 
> Suppose the translation for the word "node" was non-ASCII.  I'd expect
> the translation for that word to be encoded correctly in the output, even
> if the node name weren't.
> 
> I haven't been able to test it yet but there is a translation in French:
> 
> #: tp/Texinfo/Structuring.pm:429
> #, perl-format
> msgid "node `%s' unreferenced"
> msgstr "nœud « %s » non référencé"
> 
> If the error message became something like
> 
> "nœud « �sseul� » non référencé"


Which is the case:
testé.texi:8: warning: nœud « �sseul� » non référencé

> then encoding this to UTF-8 would break the parts which already were in
> UTF-8.

Indeed.

> The only way out would seem to be different use of the gettext functions.
> 
> I don't see that there is an option in Locale::Messages or Locale::TextDomain
> to get "unencoded" output, that is in Perl's internal string format.  The
> closest that could be done is to always output to UTF-8, possibly set the 
> UTF-8
> flag on the resulting string, and then convert this to the final message
> encoding at the end.

I think that there is another way, which is actually already used in
tp/Texinfo/Translations.pm in gdt, we already decode all the translated
messages, using bind_textdomain_filter and Encode::decode().  We could
do the same for error messages translations, as long as we know the
locale encoding.

> So my best idea at the moment for fixing the encoding of the error messages 
> is:
> * When calling gettext and related functions, always demand UTF-8, and convert
> this back into Perl's internal coding afterwards.

Why demand utf-8, any encoding should be ok?

> * Convert the messages at the time they are output.
> 
> For example, if a node name is in EUC-JP, this would be converted (internally)
> into UTF-8 when the file is read.  

Why not converted to the internal perl encoding?

> The node name would then be easily
> interpolable into a UTF-8 error message.  If the user actually wanted error
> messages to be printed in EUC-JP, then the whole error message would be
> output at the end.
> 
> As far as filename encoding goes, I suspect that use of filenames in messages
> is something that is limited in the source code so decoding of filenames
> may be something that can be limited.
> 
> 
> 
> > testé.texi:8: warning: node `' unreferenced
> > 
> > 
> > $ cat testé.texi 
> > \input texinfo.tex
> > 
> > @setfilename testé.info
> > 
> > @node Top
> > @top Testé
> > 
> > @node ésseulé
> > 
> > @node Chapitré
> > @chapter Chapitré

Re: Non-ASCII characters in @include search path

Reply via email to