On Mon, Feb 21, 2022 at 08:46:56PM +0000, Gavin Smith wrote: > On Sun, Feb 20, 2022 at 10:32:00PM +0100, Patrice Dumas wrote: > > On Sun, Feb 20, 2022 at 05:27:51PM +0000, Gavin Smith wrote: > > > If the error message became something like > > > > > > "nœud « �sseul� » non référencé" > > > > > > then encoding this to UTF-8 would break the parts which already were in > > > UTF-8. > > > > I just commited input decoding (command line, environment, translated > > messages) and output messages encoding. I left file names as is, but > > prepared a customization variable for them. > > > > Now the error message is: > > > > testé.texi:8: warning: nœud « ésseulé » non référencé > > One way of fixing this would be to store the filename separately along with > the rest of the error message, and prepend the filename when it is output. > I can try to implement this.
I am reviewing the code to find where we mix file names that will be used as bytes at some point and character strings, and it is very common. * unless I missed something, string constants are character strings. If thay are to appear mostly in file names we need to encode them at some point, but it does not seems to be easy to me to decide when, unless when we are sure that the string will only be considered as a byte sequence from then on. * many strings can come from documents, as character strings or from command line, possibly kept encoded. For example document file name can come from @setfilename or the command line (or customization variable). * many strings are used both in file names and in texts. For example the customization variable 'EXTENSION'. Even strings that are almost only used as bytes can appear in error messages, which means that we need to keep the information somewhere on how to decode them. * it is much more simpler to require customization variables from init files to be character strings, which means that we need an API to encode those we want to mix with bytes, and we cannot do this early so it means more complexity. For all those reasons, I really think that we should use character strings almost everywhere and encode when needed, such that there is no need to track down where a string comes from to be sure whether it is encoded or not. We already decode and encode in many places as we have file names used in error messages combined with character strings, character strings from Texinfo manuals that need to be encoded. The gain of avoiding to decode and encode a few strings is not covered, in my opinion by the complexity of having strings that cannot be mixed. In some cases, we can decide to consider encoded strings, still, but I think that it should only be if we are sure that they will not ever be mixed with decoded character strings. -- Pat
