Re: The encoding of GCC's stderr

Lewis Hyatt via Gcc Tue, 17 Nov 2020 09:59:10 -0800

On Tue, Nov 17, 2020 at 12:01 PM David Malcolm via Gcc <gcc@gcc.gnu.org> wrote:
>
> As far as I can tell, GCC's diagnostic output on stderr is a mixture of
> bytes from various different places in our internal representation:
> - filenames
> - format strings from diagnostic messages (potentially translated via
> .po files)
> - identifiers
> - quoted source code
> - fix-it hints
> - labels
>
> As noted in https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html
> source files can be in any character set, specified by -finput-charset=, and 
> libcpp converts that to the "source character set", Unicode, encoding it 
> internally as UTF-8.  String and character constants are then converted to 
> the execution character set (defaulting to UTF-8-encoded Unicode).  In many 
> places we use identifier_to_locale to convert from the "internal encoding" to 
> the locale character set, falling back to converting non-ASCII characters to 
> UCNs.  I suspect that there are numerous places where we're not doing that, 
> but ought to be.
>
> The only test coverage I could find for -finput-charset is
> gcc.dg/ucnid-16-utf8.c, which has a latin1-encoded source file, and
> verifies that a latin-1 encoded variable name becomes UTF-8 encoded in
> the resulting .s file.  I shudder to imagine a DejaGnu test for a
> source encoding that's not a superset of ASCII (e.g. UCS-4) - how would
> the dg directives be handled?  I wonder if DejaGnu can support tests in
> which the compiler's locale is overridden with environment variables
> (and thus having e.g. non-ASCII/non-UTF-8 output).
>
> What is the intended encoding of GCC's stderr?


I don't really have the context to comment much on this, since I just
know what I tried to figure out while adding the support for UTF-8
identifiers initially, but I thought I would note a few things that I
have come across which are relevant. One thing is that you actually
can't use -finput-charset with an encoding that is not a superset of
ASCII. I was confused by this and filed PR92987 about it
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92987), but Joseph
explained that this is expected.

Another random tidbit, it's actually hard to test some of this stuff
with dejagnu because TCL does some silent conversion without telling
you. In particular, if it sees command output that looks like latin1,
it will convert it to UTF-8 behind the scenes before presenting it to
you, irrespective of the current locale. I ran across that when
constructing the test cases for the patch attached to PR93067
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0) and had to
find some random encoding that satisfied both properties (superset of
ASCII, not looking like latin1) to use for that test.

>
> In gcc_init_libintl we call:
>
> #if defined HAVE_LANGINFO_CODESET
>   locale_encoding = nl_langinfo (CODESET);
>   if (locale_encoding != NULL
>       && (!strcasecmp (locale_encoding, "utf-8")
>           || !strcasecmp (locale_encoding, "utf8")))
>     locale_utf8 = true;
> #endif
>
> so presumably stderr ought to be nl_langinfo (CODESET).
>
> We use the above to potentially use the UTF-8 encoding of U+2018 and
> U+2019 for open/close quotes, falling back to ASCII for these.
>
> As far as I can tell, we currently:
> - blithely accept and emit filenames as bytes (I don't think we make
> any attempt to enforce that they're any particular encoding)
> - emit format strings in whatever encoding gettext gives us
> - emit identifiers as char * from IDENTIFIER_POINTER, calling
> identifier_to_locale on them in many places, but I suspect we're
> missing some
> - blithely emit quoted source code as raw bytes (this is PR
> other/93067, which has an old patch attached; presumably the source
> ought to be emitted to stderr in the locale encoding)

When I was first trying to understand this stuff for supporting UTF-8
identifiers, I discussed some of this with Joseph as well here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c28 . I would agree
there are two missing conversions... source should be converted to
UTF-8 when read for diagnostics (PR93067, that patch btw, I have
updated it so it's no longer old...), and also, it should be converted
to the current locale when output. This second part is a bit involved
though, because we presumably do not want to disturb the alignment of
labels and such. So it would seem that in case a given character
cannot be output, it should be converted to as many '?' characters as
needed to fill out the display width, or something like this.
Presumably the most common, if not only practical, case, would be
UTF-8 source, being processed in an ASCII locale... currently the user
will get UTF-8 anyway. But it can be done in general too. I could work
on this if you like... would be good to finalize the PR93067 patch
first probably. Should I prepare it and post here? I had left it on
the PR for feedback because I wasn't sure if the approach was OK or
not for how I implemented it.

> - fix-it hints can contain identifiers as char * from
> IDENTIFIER_POINTERs, which is likely UTF-8; I think I'm failing to call
> identifier_to_locale on them
> - labels can contain type names, which are likely UTF-8, and I'm
> probably failing to call identifier_to_locale on them
>
> So I think our current policy is:
> - we assume filenames are encoded in the locale encoding, and pass them
> through as bytes with no encode/decode
> - we emit to stderr in the locale encoding (but there are likely bugs
> where we don't re-encode from UTF-8 to the locale encoding)
>
> Does this sound correct?
>

I also just wanted to ask... in case we have a general system to
always convert diagnostics output to the current locale, would this
make identifier_to_locale() no longer necessary in most cases? That
may be a nice simplification as it's currently called in ~60 places.
It would be a change though I guess, in that right now you get hex
escapes, and with this change it would be question marks. It's not
possible I think to use hex escapes and also preserve the horizontal
alignment, other than with an approach like identifier_to_locale()
takes, where it adjusts the diagnostic message before diagnostics
infrastructure ever sees it.

-Lewis



> My motivation here is the discussion in [1] and [2] of supporting Emacs
> via an alternative output format for machine-readable fix-it hints,
> which has made me realize that I didn't understand our current approach
> to encodings as well as I would like.
>
> Hope this is constructive
> Dave
>
> [1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987
> [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html
>

Re: The encoding of GCC's stderr

Reply via email to