On Fri, Oct 27, 2023 at 07:05:34PM -0400, Jason Merrill wrote: > > --- gcc/testsuite/g++.dg/cpp26/literals1.C.jj 2023-08-25 > > 17:23:06.662878355 +0200 > > +++ gcc/testsuite/g++.dg/cpp26/literals1.C 2023-08-25 17:37:03.085132304 > > +0200 > > @@ -0,0 +1,65 @@ > > +// C++26 P1854R4 - Making non-encodable string literals ill-formed > > +// { dg-do compile { target c++11 } } > > +// { dg-require-effective-target int32 } > > +// { dg-options "-pedantic-errors -finput-charset=UTF-8 > > -fexec-charset=UTF-8" } > > + > > +int d = '😁'; // { dg-error > > "character too large for character literal type" } > ... > > +char16_t m = u'😁'; // { dg-error > > "character constant too long for its type" } > > Why are these different diagnostics? Why doesn't the first line already hit > the existing diagnostic that the second gets? > > Both could be clearer that the problem is that the single source character > can't be encoded as a single execution character.
The first diagnostics is the newly added in the patch which takes precedence over the existing diagnostics (and wouldn't actually trigger without the patch). Sure, I could make that new diagnostics more specific, but all I generally know is that (str2.len / nbwc) c-chars are encodable in str.len execution character set code units. So, would you like 2 different messages, one for str2.len / nbwb == 1 "single character not encodable in a single execution character code unit" and otherwise "%d characters need %d execution character code units" or "at least one character not encodable in a single execution character code unit" or something different? Everything else (i.e. u8 case in narrow_str_to_charconst and L, u and U cases in wide_str_to_charconst) is already covered by existing diagnostics which has the "character constant too long for its type" wording and covers for both C and C++ both the cases where there are more than one c-chars in the literal (allowed in the L case for < C++23) and when one c-char encodes in more than one code units (but this time it isn't execution character set, but UTF-8 character set for u8, wide execution character set for L, UTF-16 character set for u and UTF-32 for U). Plus the same "character constant too long for its type" diagnostics is emitted if normal narrow literal has several c-chars encodable all as single execution character code units, but more than can fit into int. So, do you want to change just the new diagnostics (and what is your preferred wording), or use the old diagnostics wording also for the new one, or do you want to change the preexisting diagnostics as well and e.g. differentiate there between the single c-char cases which need more than one code unit and different wording for more than one c-char? Note, if we differentiate between those, we'd need to count how many c-chars we have even for the u8, L, u and U cases if we see more than one code unit, similarly how the patch does that (and also the follow-up patch tweaks). Jakub