> That's simple, but how would you deal with the fact that > Unicode has multiple representations of what people would usually > regard as equivalent? To enable UTF-8 identifiers, that has > to be taken care of by gcc and linker (if gcc doesn't do a > compile-time normalization).
I'd say you wouldnt :) Just accept a null-terminated string of non "/"s for filenames and accept any ALPHANUMERIC "_" or HIGH_ASCII for identifiers. No normalization, no processing, not even proper utf-8 validation. The programmer of course may choose to use proper utf-8 and some normalization form as a convention, but I see no need to enforce it it the compiler. > > Is there anything mentioned about this in SUS? Im sorry, what is SUS? > > Text strings and comments already work fine with utf-8. Just > > identifiers dont. I think even a "use at your own risk" command > > line switch, such as "--allow-high-ascii" would be a huge step > > forward. > > Why would you use such a 'legacy-sounding' option name? I'd use > '--allow-utf8-names'. It is legacy sounding, because I would rather have it be the default. Its more appropriate as well: The compiler would'nt have to know anything about utf-8 in this case, it just knows that there are a set of bytes which dont cause any problems. This is, I think, a large part of what utf-8 was designed for, originally. Normalization, imo, is more for UI/security issues, like DNS lookups, etc. Besides, if you were to come ascross some source code with tons of overcoded utf-8, or non-normalized glyphs, that would raise some eyebrows at least. (no need to have gcc bend over backwards to normalize the stuff) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
