UTF-8 quotation marks in diagnostics

D. Hugh Redelmeier Wed, 21 Oct 2015 15:24:13 -0700

Several of us don't want UTF-8 quotation marks in diagnostics in our 
environment (Jove subshells).  We'd like a way to turn them off.  We don't 
think that they are a bad idea but they are bad in our environment.


<https://gcc.gnu.org/gcc-4.0/changes.html>

        English-language diagnostic messages will now use Unicode
        quotation marks in UTF-8 locales. (Non-English messages
        already used the quotes appropriate for the language in
        previous releases.) If your terminal does not support UTF-8
        but you are using a UTF-8 locale (such locales are the default
        on many GNU/Linux systems) then you should set LC_CTYPE=C in
        the environment to disable that locale. Programs that parse
        diagnostics and expect plain ASCII English-language messages
        should set LC_ALL=C. See Markus Kuhn's explanation of Unicode
        quotation marks for more information.

This suggests that LC_CTYPE=C would do what we want: go back to ` and
' instead of 342\200\230 and \342\200\231.

I find that a little confusing and scary.  I would expect that setting
LC_CTYPE=C would have the affect of changing the lexing done by the C
compiler.  For one thing, valid characters in strings would be
different.  This we don't want.

gcc(1) says:

        The LC_CTYPE environment variable specifies character
        classification.  GCC uses it to determine the character
        boundaries in a string; this is needed for some multibyte
        encodings that contain quote and escape characters that are
        otherwise interpreted as a string end or escape.

        The LC_MESSAGES environment variable specifies the language to
        use in diagnostic messages.


An experiment on my Fedora 20 system shows:

- LANG=en_CA.UTF-8 [correct]

- LC_CTYPE isn't set by default

- setting LC_CTYPE to C gets rid of the UTF-8 quotes in GCC diagnostics.
  That's surprising because the manpage doesn't say that it affects diagnostics.

- setting LC_MESSAGES to C DOES NOT get rid of the UTF-8 quotes in GCC 
diagnostics
  That's surprising because the manpage does say that it affects diagnostics.
  I hope that it only affect compile-time diagnostics.

That sure sounds like we should NOT set LC_CTYPE=C because of bad
side-effects: it changes how the program is lexed.  And the
documentation gives no basis for thinking that it would suppress those
UTF-8 quotes in messages (even though testing shows that this works).

That sure sounds like we should set LC_MESSAGES=C, but that doesn't work.

In our environment, our tool doesn't know that gcc is being invoked.
So the solution needs to be targetted.  That's why a solution like
GCC_COLOURS would be good.  In fact, it could probably be hacked into 
GCC_COLOURS.

Man pages in section 1 that explicitly reference LC_CTYPE:
        enca
        enconv
        find
        gcc
        gnroff
        grep
        jove
        koi8rxterm
        less
        locale
        localedef
        nroff
        perl5004delta
        perl5160delta
        perl58delta
        perlfunc
        perllocale
        perltoc
        pico
        pilot
        sh
        systemd
        time
        tree
        uxterm
        xterm
So I feel uncomfortable setting it.

Man pages in section 1 that explicitly reference LC_MESSAGES:
        apropos
        aspell
        awk
        bash
        enca
        enconv
        find
        gawk
        gcc
        grep
        hunspell
        install-tl
        locale
        localectl
        localedef
        lynx
        man
        nmcli
        perllocale
        perltoc
        sh
        systemd
        systemd-firstboot
        time
        whatis
        xdg-desktop-icon
        xdg-desktop-menu
So setting this would hardly be safer.

UTF-8 quotation marks in diagnostics

Reply via email to