On Fri, Mar 15, 2024 at 10:20:06AM -0700, Andi Kleen wrote: > > One issue is the character set question. The strings in inline asm right > > now are untranslated, i.e. remain in SOURCE_CHARSET, usually UTF-8 (in > > theory there is also UTF-EBCDIC support, but nobody knows if it actually > > works), which is presumably what the assembler expects too. > > Most of the string literals and character literals constexpr deals with > > are in the execution character set though. For static_assert we just assume > > the user knows what he is doing when trying to emit non-ASCII characters in > > the message when using say -fexec-charset=EBCDICUS , but should that be the > > case for inline asm too? Or should we try to translate those strings back > > from execution character set to source character set? Or require that it > > actually constructs UTF-8 string literals and for the UTF-EBCDIC case > > translate from UTF-8 to UTF-EBCDIC? So the user constexpr functions then > > would return u8"insn"; or construct from u8'i' etc. character literals... > > I think keeping it untranslated is fine for now. Any translation > if really needed would be a separate feature.
I mean, unless you make extra effort, it is translated. Even in your current version, try constexpr *foo () { return "nop"; } and you'll see that it actually results in "\x95\x96\x97" with -fexec-charset=EBCDICUS. What is worse, constexpr *bar () { return "%0 %1"; } results in "\x6c\xf0\x40\x6c\xf1", so the compiler will not be able to find the % special characters in there etc. The parsing of the string literal in asm definitions uses translate=false to avoid the translations. As the static_assert paper says, for static_assert it isn't that big a deal, the program is already UB if it diagnoses static assertion failure, worst case it prints garbage if one plays with -fexec-charset=. But for inline asm it would fail to compile... So, the extension really should be well defined vs. the character set, either it should be characters in the execution charset and the FE would need to ask libcpp to translate it back, or it would need to be declared to be e.g. in UTF-8 regardless of the charset (like u8'x' or u8"abc" literals are; but then shouldn't the _M_data in that case return a pointer to char8_t instead), something else? Jakub