[Bug c/67224] UTF-8 support for identifier names in GCC

ejolson at unr dot edu Tue, 18 Aug 2015 15:55:08 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224


--- Comment #18 from Eric <ejolson at unr dot edu> ---
Thanks Joseph for the clarification about the two different versions of iconv. 
I was admittedly confused about this until moments ago.  Anyway, I just
discovered that libiconv doesn't support conversions to and from the IBM1047
EBCDIC character set and this causes some of the regression tests to fail. 
Coupled with the fact that C99 isn't supported in the glibc version of iconv
this creates a little problem with my patch.

You mention a bigger problem which I had not thought about:  the C++ semantics
of raw strings.  Processing UCNs in C++ code apparently requires surprisingly
deep syntactic analysis.  Raw literals seem to appear in the gnu99 and gnu11
extensions to C as well.

Amusingly, if I understand the C++ specifications

    www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf

trigraphs are supposed to be interpreted before any other processing takes
place.  However, the simple code

    #include <stdio.h>
    int main(){
        char p1[]="??/u00E4";
        char p2[]=R"(??/u00E4)";
        char p3[]=R"(\u00E4)";
        printf("%s or %s or %s\n",p1,p2,p3);
        return 0;
    }

compiled with

    $ g++ -std=c++11 pp.c

produces output

    ä or ??/u00E4 or \u00E4

which illustrates that g++ does not process trigraphs inside raw string
literals.  Admittedly I'm looking at the draft standard, but I don't think this
is something which changed suddenly in the final draft.  Clearly, my patch
makes a further mess of raw string literals in gcc.  My first reaction is that
raw string literals were not well thought out, but I suppose bad standards are
sometimes better than no standards.  At anyrate, there appears no easy way of
supporting both UTF-8 identifiers and raw literal strings.

My plan for now is to take a break and keep my UTF-8 identifier support as a
one-line patch reliant on libiconv which breaks EBCDIC encodings and raw string
literals.

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to