[Bug c/67224] UTF-8 support for identifier names in GCC

ejolson at unr dot edu Tue, 18 Aug 2015 12:24:56 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224


--- Comment #14 from Eric <ejolson at unr dot edu> ---
While there may not be current demand for gcc to accept UTF-8 identifiers, the
fact that clang and Visual Studio support this C99 feature means source code
using Greek and accented characters in variable names is likely to become more
prevalent over time.

I have done a little testing to check by default whether string literals can
contain arbitrary 8-bit data.  This is used, for example, in legacy code which
directly includes graphics characters from CP437.  The original preprocessor
specifies "UTF-8" as the default input character set and "UTF-8" as the
internal character set.  Then, if the internal and working character sets are
identical no translation is done and arbitrary 8-bit data is passed through
cleanly.  A slight modification to my patch needs to be made to retain the same
behavior.  In particular, the patch now specifies both the internal and default
input character sets to be "C99" so no translation is done by default.  The
improved patch also includes consideration of EBCDIC hosts.

As iconv was installed on every GNU/Linux system I've tried, I'm not sure what
is wrong with using the C99 mode present in newer releases.  This achieves
exactly the suggested result of converting all UTF-8 input to UCNs in the
preprocessor while directly allowing other potentially useful conversions. 
Perhaps the configure script should be modified to check for a compatibile
version of iconv and if one is not found resort to a manual conversion.

Testing is still underway.  After the standard regression tests are finished I
will create new tests utf8id-.* which will be versions of the uncid-.* tests
for native utf-8 files.  I will also include a new test for arbitrary 8-bit
string literals, to verify further compatibility.

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to