> I think that CPP should try to determine the encoding for each file > and not use a single encoding for every file. It should look for > a unicode header when it opens a file (original c source or any > include), and if it doesn't find one, use the default: -finput-charset, > LC_CTYPE, UTF-8, until it's done processing that file. Note that > vim is reads files saved with unicode headers without problem.
This is a desired feature, but one that no one has ever had time to implement. If you implement it, I can critique it until it is ready for inclusion. [Editors that put a BOM on files in UTF-8 are in error, but it is a common error so it should be accepted gracefully. And, of course, it is supposed to be there on a UTF-16 or UTF-32 file.] Note that GCC should not be limited to looking for the Unicode "byte order mark". It should recognize and handle all other reasonable in-band annotations of the file encoding. Examples are Emacs' -*- marker in a comment on the first line and (rather more complicated) "Local Variables:" marker near the end of the file; other editors have similar, but of course incompatible, conventions (I know Vim has one but I don't know what it looks like). It would also be good to take advantage of the fact that 95+% of C source files start with "/*", "//", "#i", or "#d" to distinguish ASCII from EBCDIC. (This is in fact necessary in order to have any hope of detecting and processing an editor's code page marker in an EBCDIC source file.) You should have read and fully understood the long comment near the top of libcpp/charset.c, and the sections of the C standard that it refers to, before you attempt to code this. It may be necessary to import GNU iconv to the source tree in order to gain reliable handling of non-Unicode encodings. This should not be hard but has to be run by the steering committee. zw