[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

ThePhD via Phabricator via cfe-commits Mon, 12 Apr 2021 15:05:01 -0700

ThePhD created this revision.
ThePhD added a reviewer: aaron.ballman.
ThePhD added a project: clang.
ThePhD requested review of this revision.
Herald added a subscriber: cfe-commits.


**//Short version://**

Please let us know the encoding that the compiler chooses for it's 
implementation-defined, non-Unicode string literals so we can support users 
properly in cross-platform code.

**//Prior Art://**

A similar feature has already been patch-reviewed and merged into GCC trunk (I 
implemented it), ready to go out the door with GCC 11. It is compiler-specific, 
and that is intentional. It solved a user's bug report there.

I also put in a Feature Request for MSVC. It is also recommended to be 
compiler-specific. They are currently suffering the very interesting 
consequences of not handling it sooner. stdlib library developers having to 
come up with library-based workarounds to determine the charset format of their 
string literals and praying rather than having a compiler macro for it with 
their std::fmt implementation: https://github.com/microsoft/STL/pull/1824 | 
https://github.com/microsoft/STL/issues/1576 | 
https://developercommunity.visualstudio.com/content/idea/1160821/-compiler-feature-macro-for-narrow-literal-foo-enc.html

The C++ Standard's Committee Study Group 16 - Unicode approved a paper that is 
currently undergoing LEWG to determine the string literal and wide string 
literal encoding at both compile-time and runtime; this patch prepares for the 
compile-time portion of that detection, which Corentin Jabot already created a 
proof-of-concept of for Clang, GCC and MSVC: https://wg21.link/p1885

I missed the 12 release, so I hope this makes it for 13.

**//Long version://**

C and C++'s string literals for both "narrow"/"multibyte" string literals (e.g. 
"foo") and "wide" string literals (e.g. L"foo") have an associated encoding 
defined by the implementation. Recently, a review has kicked off for both 
adding new "execution encodings" (e.g., string literal encodings) to Clang's 
Preprocessor and, subsequently, C and C++ frontends.

I left a comment for it to be taken care of but I'm certain the comment was 
drowned out by other contributions in both the -fexec-charset addition and the 
"Add support for iconv encodings and other things" patch review at:

iconv Literal Converter: https://reviews.llvm.org/D88741
-fexec-charset Enabling Patch: https://reviews.llvm.org/D93031

Whether or not this gets updated in the related (but not required) patches, 
this is necessary to successfully inform the end user on a Clang machine what 
the wide string literal and the narrow string literal encoding is.

We use the size of the wide character type (`wchar_t`) to inform our decision, 
as Windows and other old-style 32-bit IBM machines use UTF-16, while most Linux 
distributions use UTF-32. (This is not the case for IBM and other machines of 
specific make in China, Japan, and Korea, but I suspect Clang has not been 
ported to work on such machines.)

Knowing the literal and execution encodings is also of great importance to the 
C++ Standard Committee in general, as they have work coming down the pipeline 
that has been generally approved by SG16 and favorably reviewed by LEWG that 
will make use of such functionality soon, as mentioned in the Prior Art section 
above: https://wg21.link/p1885

Please consider making everyone who cares about portable encoding's lives 
easier, and please consider making the work on `-fexec-charset` and 
`-fwide-exec-charset`.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D100346

Files:
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===================================================================
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck 
-match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===================================================================
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,20 @@
     }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+    // NOTE: 32-bit wchar_t signals UTF-32. This may change if 
+    // -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+    // NOTE: Less-than 32-bit wchar_t generally means UTF-16 (e.g., Windows, 
32-bit IBM).
+    // This may change if -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
     Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)

Index: clang/test/Preprocessor/init.c
===================================================================
--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck -match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===================================================================
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,20 @@
     }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+    // NOTE: 32-bit wchar_t signals UTF-32. This may change if 
+    // -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+    // NOTE: Less-than 32-bit wchar_t generally means UTF-16 (e.g., Windows, 32-bit IBM).
+    // This may change if -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
     Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

Reply via email to