On Fri, 2016-07-08 at 17:49 -0400, David Malcolm wrote: [...] > Also, this patch currently makes the assumption (in charset.c) > that there's a 1:1 correspondence between bytes in the source > character set and bytes in the execution character set. This can > be the case if both are, say, UTF-8, but might not hold in > general. > > The source char set is UTF-8 or UTF-EBCDIC, and safe-ctype.c has: > > # if HOST_CHARSET == HOST_CHARSET_EBCDIC > #error "FIXME: write tables for EBCDIC" > > so presumably we don't actually have any hosts that supports EBCDIC > (do we?); as far as I can tell, we only currently support UTF-8 > as the source char set. > > Similarly, do we support any targets for which the execution > character set is *not* UTF-8?
I brought this up in this thread on the gcc mailing list: "gcc/libcpp: non-UTF-8 source or execution encodings?" https://gcc.gnu.org/ml/gcc/2016-07/msg00091.html and in particular: https://gcc.gnu.org/ml/gcc/2016-07/msg00106.html it's possible to select the execution char set using at the command -line for C-family frontends using: -fexec-charset= -fwide-exec-charset= e.g. "-fexec-charset=IBM1047" will give one of the variants of EBCDIC. Given that the internal interface already has a failure mode, I'm thinking that a reasonable restriction is to only support locations within string literals for the case where source character set == execution character set, and hence we have "convert_no_conversion" as the converter. Does that sound sane? (I can write test coverage for this). [...]