rsmith added a comment.

I'm overall pretty happy about how clean and non-invasive the changes required 
here are. But please make sure you don't change the encodings of `u8"..."` / 
`u"..."` / `U"..."` literals; those need to stay as UTF-8 / UTF-16 / UTF-32. 
Also, we should have a story for how the wide execution character set is 
controlled -- is it derived from the narrow execution character set, or can the 
two be changed independently, or ...?

We should use the original source form of the string literal when 
pretty-printing a `StringLiteral` or `CharacterLiteral`; there are a bunch of 
UTF-8 assumptions baked into `StmtPrinter` that will need revisiting. And we'll 
need to modify the handful of places that put the contents of `StringLiteral`s 
into diagnostics (`#warning`, `#error`, `static_assert`) and make them use a 
different `ConversionState`, since our assumption is that diagnostic output 
should be in UTF-8.



================
Comment at: clang/include/clang/Lex/LiteralTranslator.h:29-31
+  static llvm::StringRef InternalCharset;
+  static llvm::StringRef SystemCharset;
+  static llvm::StringRef ExecCharset;
----------------
It is not acceptable to use global state for this per-compilation information; 
this will behave badly if multiple independent Clang compilations are performed 
by different threads in the same process, for example.


================
Comment at: clang/include/clang/Lex/LiteralTranslator.h:32
+  static llvm::StringRef ExecCharset;
+  static llvm::StringMap<llvm::CharSetConverter> ExecCharsetTables;
+
----------------
Similarly, use of a global cache here will require you guard it with a mutex.

As an alternative, how about we move all this state to be per-instance state, 
and store an instance of `LiteralTranslator` on the `Preprocessor`?


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:231-239
+  if (Translate && Converter) {
+    // ResultChar is either UTF-8 or ASCII literal and can only be converted
+    // to EBCDIC on z/OS if the character can be represented in one byte.
+    if (ResultChar < 0x100) {
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
Is it correct, in general, to do character-at-a-time translation here, when 
processing a string literal? I would expect there to be some (stateful) target 
character sets where that's not correct.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:236
+      SmallString<8> ResultCharConv;
+      Converter->convert(StringRef((char *)&ResultChar), ResultCharConv);
+      void *Pointer = &ResultChar;
----------------
Reinterpreting an `unsigned*` as a `char*` like this is not correct on 
big-endian, and is way too "clever" on little-endian. Please create an actual 
`char` object to hold the value and pass that in instead.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+      void *Pointer = &ResultChar;
+      memcpy(Pointer, ResultCharConv.data(), sizeof(unsigned));
+    }
----------------
What should happen if the result doesn't fit into an `unsigned`? This also 
appears to be making problematic assumptions about the endianness of the host. 
If we really want to pack multiple bytes of encoded output into a single 
`unsigned` result value (which itself seems dubious), we should do so with an 
endianness that doesn't depend on the host.


================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1321
+  llvm::CharSetConverter *Converter =
+      StringLiteralParser::Translator.getCharConversionTable(TranslationState);
+
----------------
Shouldn't this depend on the kind of literal? We should have no converter for 
UTF8/UTF16/UTF32 literals, should use the wide execution character set for 
`L...` literals, and the narrow execution character set otherwise. (It looks 
like this patch doesn't properly distinguish the narrow and wide execution 
character sets?)


================
Comment at: clang/test/Driver/cl-options.c:214
+// RUN: %clang_cl /execution-charset:iso8859-1 -### -- %s 2>&1 | FileCheck 
-check-prefix=execution-charset-iso8859-1 %s
+// execution-charset-iso8859-1-NOT: invalid value 'iso8859-1' in 
'/execution-charset:iso8859-1'
+
----------------
Checking for "don't produce exactly this one spelling of this one diagnostic" 
is not a useful test; if we started warning on this again, there's a good 
chance the warning would be spelled differently, so your test does not do a 
good job of determining whether the code under test is bad (it passes in most 
bad states as well as in the good state). `...-NOT: error` and `...-NOT: 
warning` would be a bit better, if this is worth testing.


================
Comment at: clang/test/Driver/clang_f_opts.c:213
+// RUN: %clang -### -S -fexec-charset=iso8859-1 -o /dev/null %s 2>&1 | 
FileCheck -check-prefix=CHECK-VALID-INPUT-CHARSET %s
+// CHECK-VALID-INPUT-CHARSET-NOT: error: invalid value 'iso8859-1' in 
'-fexec-charset=iso8859-1'
 
----------------
Again, this is not a useful test.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93031/new/

https://reviews.llvm.org/D93031

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to