rsmith added a comment. I'm overall pretty happy about how clean and non-invasive the changes required here are. But please make sure you don't change the encodings of `u8"..."` / `u"..."` / `U"..."` literals; those need to stay as UTF-8 / UTF-16 / UTF-32. Also, we should have a story for how the wide execution character set is controlled -- is it derived from the narrow execution character set, or can the two be changed independently, or ...?
We should use the original source form of the string literal when pretty-printing a `StringLiteral` or `CharacterLiteral`; there are a bunch of UTF-8 assumptions baked into `StmtPrinter` that will need revisiting. And we'll need to modify the handful of places that put the contents of `StringLiteral`s into diagnostics (`#warning`, `#error`, `static_assert`) and make them use a different `ConversionState`, since our assumption is that diagnostic output should be in UTF-8. ================ Comment at: clang/include/clang/Lex/LiteralTranslator.h:29-31 + static llvm::StringRef InternalCharset; + static llvm::StringRef SystemCharset; + static llvm::StringRef ExecCharset; ---------------- It is not acceptable to use global state for this per-compilation information; this will behave badly if multiple independent Clang compilations are performed by different threads in the same process, for example. ================ Comment at: clang/include/clang/Lex/LiteralTranslator.h:32 + static llvm::StringRef ExecCharset; + static llvm::StringMap<llvm::CharSetConverter> ExecCharsetTables; + ---------------- Similarly, use of a global cache here will require you guard it with a mutex. As an alternative, how about we move all this state to be per-instance state, and store an instance of `LiteralTranslator` on the `Preprocessor`? ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239 + if (Translate && Converter) { + // ResultChar is either UTF-8 or ASCII literal and can only be converted + // to EBCDIC on z/OS if the character can be represented in one byte. + if (ResultChar < 0x100) { + SmallString<8> ResultCharConv; + Converter->convert(StringRef((char *)&ResultChar), ResultCharConv); + void *Pointer = &ResultChar; ---------------- Is it correct, in general, to do character-at-a-time translation here, when processing a string literal? I would expect there to be some (stateful) target character sets where that's not correct. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:236 + SmallString<8> ResultCharConv; + Converter->convert(StringRef((char *)&ResultChar), ResultCharConv); + void *Pointer = &ResultChar; ---------------- Reinterpreting an `unsigned*` as a `char*` like this is not correct on big-endian, and is way too "clever" on little-endian. Please create an actual `char` object to hold the value and pass that in instead. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:238 + void *Pointer = &ResultChar; + memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned)); + } ---------------- What should happen if the result doesn't fit into an `unsigned`? This also appears to be making problematic assumptions about the endianness of the host. If we really want to pack multiple bytes of encoded output into a single `unsigned` result value (which itself seems dubious), we should do so with an endianness that doesn't depend on the host. ================ Comment at: clang/lib/Lex/LiteralSupport.cpp:1321 + llvm::CharSetConverter *Converter = + StringLiteralParser::Translator.getCharConversionTable(TranslationState); + ---------------- Shouldn't this depend on the kind of literal? We should have no converter for UTF8/UTF16/UTF32 literals, should use the wide execution character set for `L...` literals, and the narrow execution character set otherwise. (It looks like this patch doesn't properly distinguish the narrow and wide execution character sets?) ================ Comment at: clang/test/Driver/cl-options.c:214 +// RUN: %clang_cl /execution-charset:iso8859-1 -### -- %s 2>&1 | FileCheck -check-prefix=execution-charset-iso8859-1 %s +// execution-charset-iso8859-1-NOT: invalid value 'iso8859-1' in '/execution-charset:iso8859-1' + ---------------- Checking for "don't produce exactly this one spelling of this one diagnostic" is not a useful test; if we started warning on this again, there's a good chance the warning would be spelled differently, so your test does not do a good job of determining whether the code under test is bad (it passes in most bad states as well as in the good state). `...-NOT: error` and `...-NOT: warning` would be a bit better, if this is worth testing. ================ Comment at: clang/test/Driver/clang_f_opts.c:213 +// RUN: %clang -### -S -fexec-charset=iso8859-1 -o /dev/null %s 2>&1 | FileCheck -check-prefix=CHECK-VALID-INPUT-CHARSET %s +// CHECK-VALID-INPUT-CHARSET-NOT: error: invalid value 'iso8859-1' in '-fexec-charset=iso8859-1' ---------------- Again, this is not a useful test. Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D93031/new/ https://reviews.llvm.org/D93031 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits