[llvm-bugs] [Bug 41536] [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8

via llvm-bugs Fri, 19 Apr 2019 11:45:05 -0700

https://bugs.llvm.org/show_bug.cgi?id=41536


Eli Friedman <efrie...@quicinc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
                 CC|                            |efrie...@quicinc.com
         Resolution|WONTFIX                     |---

--- Comment #2 from Eli Friedman <efrie...@quicinc.com> ---
There's a real issue here, I think.  Yes, "\U" escapes specify a Unicode
character, but the standard doesn't specify how Unicode characters are encoded
(outside of u/U/u8 string literals).

Specifically, the issue here is that clang-cl has a different default from cl
for /execution-charset.

clang currently does not support anything equivalent to the MSVC
/execution-charset flag.  It assumes the source and execution charset are both
UTF-8 (as if the MSVC "/utf-8" flag was passed).  We mostly get away with this
at the moment because most source code is ASCII, and we have a hack to pass
through the raw bytes of string literals even if they aren't valid UTF-8.

It's not clear we would actually want to change the defaults here, but it seems
like a legitimate request to provide the option to specify /execution-charset
and /source-charset.

It would be a substantial project to implement /execution-charset and
/source-charset, probably. There isn't anything fundamentally tricky; for any
ASCII-compatible encoding, it's basically just a matter of translating string
literals and identifiers correctly.  (We generally don't need to translate
comments, and non-ASCII characters aren't legal anywhere else.)  But LLVM
currently doesn't have any support for translating from Unicode to non-Unicode
charsets, so it's likely to spark a complicated debate over how to perform that
translation.

See also bug 39864.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 41536] [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8

Reply via email to