On Tue, Nov 17, 2020 at 12:01 PM David Malcolm via Gcc <gcc@gcc.gnu.org> wrote: > > As far as I can tell, GCC's diagnostic output on stderr is a mixture of > bytes from various different places in our internal representation: > - filenames > - format strings from diagnostic messages (potentially translated via > .po files) > - identifiers > - quoted source code > - fix-it hints > - labels > > As noted in https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html > source files can be in any character set, specified by -finput-charset=, and > libcpp converts that to the "source character set", Unicode, encoding it > internally as UTF-8. String and character constants are then converted to > the execution character set (defaulting to UTF-8-encoded Unicode). In many > places we use identifier_to_locale to convert from the "internal encoding" to > the locale character set, falling back to converting non-ASCII characters to > UCNs. I suspect that there are numerous places where we're not doing that, > but ought to be. > > The only test coverage I could find for -finput-charset is > gcc.dg/ucnid-16-utf8.c, which has a latin1-encoded source file, and > verifies that a latin-1 encoded variable name becomes UTF-8 encoded in > the resulting .s file. I shudder to imagine a DejaGnu test for a > source encoding that's not a superset of ASCII (e.g. UCS-4) - how would > the dg directives be handled? I wonder if DejaGnu can support tests in > which the compiler's locale is overridden with environment variables > (and thus having e.g. non-ASCII/non-UTF-8 output). > > What is the intended encoding of GCC's stderr?
I don't really have the context to comment much on this, since I just know what I tried to figure out while adding the support for UTF-8 identifiers initially, but I thought I would note a few things that I have come across which are relevant. One thing is that you actually can't use -finput-charset with an encoding that is not a superset of ASCII. I was confused by this and filed PR92987 about it (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92987), but Joseph explained that this is expected. Another random tidbit, it's actually hard to test some of this stuff with dejagnu because TCL does some silent conversion without telling you. In particular, if it sees command output that looks like latin1, it will convert it to UTF-8 behind the scenes before presenting it to you, irrespective of the current locale. I ran across that when constructing the test cases for the patch attached to PR93067 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067#c0) and had to find some random encoding that satisfied both properties (superset of ASCII, not looking like latin1) to use for that test. > > In gcc_init_libintl we call: > > #if defined HAVE_LANGINFO_CODESET > locale_encoding = nl_langinfo (CODESET); > if (locale_encoding != NULL > && (!strcasecmp (locale_encoding, "utf-8") > || !strcasecmp (locale_encoding, "utf8"))) > locale_utf8 = true; > #endif > > so presumably stderr ought to be nl_langinfo (CODESET). > > We use the above to potentially use the UTF-8 encoding of U+2018 and > U+2019 for open/close quotes, falling back to ASCII for these. > > As far as I can tell, we currently: > - blithely accept and emit filenames as bytes (I don't think we make > any attempt to enforce that they're any particular encoding) > - emit format strings in whatever encoding gettext gives us > - emit identifiers as char * from IDENTIFIER_POINTER, calling > identifier_to_locale on them in many places, but I suspect we're > missing some > - blithely emit quoted source code as raw bytes (this is PR > other/93067, which has an old patch attached; presumably the source > ought to be emitted to stderr in the locale encoding) When I was first trying to understand this stuff for supporting UTF-8 identifiers, I discussed some of this with Joseph as well here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c28 . I would agree there are two missing conversions... source should be converted to UTF-8 when read for diagnostics (PR93067, that patch btw, I have updated it so it's no longer old...), and also, it should be converted to the current locale when output. This second part is a bit involved though, because we presumably do not want to disturb the alignment of labels and such. So it would seem that in case a given character cannot be output, it should be converted to as many '?' characters as needed to fill out the display width, or something like this. Presumably the most common, if not only practical, case, would be UTF-8 source, being processed in an ASCII locale... currently the user will get UTF-8 anyway. But it can be done in general too. I could work on this if you like... would be good to finalize the PR93067 patch first probably. Should I prepare it and post here? I had left it on the PR for feedback because I wasn't sure if the approach was OK or not for how I implemented it. > - fix-it hints can contain identifiers as char * from > IDENTIFIER_POINTERs, which is likely UTF-8; I think I'm failing to call > identifier_to_locale on them > - labels can contain type names, which are likely UTF-8, and I'm > probably failing to call identifier_to_locale on them > > So I think our current policy is: > - we assume filenames are encoded in the locale encoding, and pass them > through as bytes with no encode/decode > - we emit to stderr in the locale encoding (but there are likely bugs > where we don't re-encode from UTF-8 to the locale encoding) > > Does this sound correct? > I also just wanted to ask... in case we have a general system to always convert diagnostics output to the current locale, would this make identifier_to_locale() no longer necessary in most cases? That may be a nice simplification as it's currently called in ~60 places. It would be a change though I guess, in that right now you get hex escapes, and with this change it would be question marks. It's not possible I think to use hex escapes and also preserve the horizontal alignment, other than with an approach like identifier_to_locale() takes, where it adjusts the diagnostic message before diagnostics infrastructure ever sees it. -Lewis > My motivation here is the discussion in [1] and [2] of supporting Emacs > via an alternative output format for machine-readable fix-it hints, > which has made me realize that I didn't understand our current approach > to encodings as well as I would like. > > Hope this is constructive > Dave > > [1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 > [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html >