Re: Non-ASCII characters in @include search path

Patrice Dumas Fri, 25 Feb 2022 15:18:06 -0800

On Mon, Feb 21, 2022 at 08:46:56PM +0000, Gavin Smith wrote:
> On Sun, Feb 20, 2022 at 10:32:00PM +0100, Patrice Dumas wrote:
> > On Sun, Feb 20, 2022 at 05:27:51PM +0000, Gavin Smith wrote:
> > > If the error message became something like
> > > 
> > > "nœud « �sseul� » non référencé"
> > > 
> > > then encoding this to UTF-8 would break the parts which already were in
> > > UTF-8.
> > 
> > I just commited input decoding (command line, environment, translated
> > messages) and output messages encoding.  I left file names as is, but
> > prepared a customization variable for them.
> > 
> > Now the error message is:
> > 
> > testÃ©.texi:8: warning: nœud « ésseulé » non référencé
> 
> One way of fixing this would be to store the filename separately along with
> the rest of the error message, and prepend the filename when it is output.
> I can try to implement this.


I am reviewing the code to find where we mix file names that will be
used as bytes at some point and character strings, and it is very common.

* unless I missed something, string constants are character strings. If
  thay are to appear mostly in file names we need to encode them at some
  point, but it does not seems to be easy to me to decide when, unless
  when we are sure that the string will only be considered as a byte
  sequence from then on.
* many strings can come from documents, as character strings or from
  command line, possibly kept encoded.  For example document file name
  can come from @setfilename or the command line (or customization
  variable).
* many strings are used both in file names and in texts.  For example
  the customization variable 'EXTENSION'.  Even strings that are almost
  only used as bytes can appear in error messages, which means that we
  need to keep the information somewhere on how to decode them.
* it is much more simpler to require customization variables from init
  files to be character strings, which means that we need an API to
  encode those we want to mix with bytes, and we cannot do this early so
  it means more complexity.

For all those reasons, I really think that we should use character
strings almost everywhere and encode when needed, such that there is
no need to track down where a string comes from to be sure whether it
is encoded or not.  We already decode and encode in many places as we
have file names used in error messages combined with character strings,
character strings from Texinfo manuals that need to be encoded.  The
gain of avoiding to decode and encode a few strings is not covered, in
my opinion by the complexity of having strings that cannot be mixed.

In some cases, we can decide to consider encoded strings, still, but I
think that it should only be if we are sure that they will not ever be
mixed with decoded character strings.

-- 
Pat

Re: Non-ASCII characters in @include search path

Reply via email to