cor3ntin added inline comments.
================ Comment at: clang/test/SemaCXX/static-assert-cxx26.cpp:304 +static_assert('\u{9}' == (char)1, ""); // expected-error {{failed}} \ + // expected-note {{evaluates to ''\t' (0x09, 9) == '<U+0001>' (0x01, 1)'}} +static_assert((char8_t)-128 == (char8_t)-123, ""); // expected-error {{failed}} \ ---------------- tahonermann wrote: > cor3ntin wrote: > > tahonermann wrote: > > > cor3ntin wrote: > > > > tahonermann wrote: > > > > > Is the expected note up to date? I don't see code that would generate > > > > > the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, > > > > > though non-printable, character, I would expect more `'\u0001'`. > > > > See elsewhere in the discussion. this formating is pre existing and > > > > managed at the DiagnosticEngine level (pushEscapedString). the reason > > > > it's not `\u0001` is 1/ to avoid reusing c++ syntactic elements for > > > > something that comes from diagnostics and is not represented as an > > > > escaped sequence in source 2/ `\u00011` is unreadable, and > > > > `\U000000001` is also not helpful :) > > > > > > > Thanks for the explanation. I'm not sure that I agree with the rationale > > > for (1) though. We're already putting the value in single quotes and > > > representing some values with escapes in many of these cases when the > > > value isn't produced by an escape sequence (or even a character/string > > > literal); why exclude `\uXXXX`? I agree with the rationale for (2); we > > > could use `'\u{1}'` in that case. > > FYI afaik the notation in clang predates the existence of \u{} by a few > > years, and follow Unicode notation > > (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html). > > Oldest instance seems to be > > https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db > > - i followed suite when reworking the generic escaping mechanism all > > string fed to diagnostics go through. > > > > I don't care about changing the syntax, but i do hope we are consistent. > > Ultimately what we are trying to do is to designate a unicode codepoint and > > whether we do it through C++ syntax or not probably does not matter much as > > long as it's clear, delimited and consistent! > I think the substitution of `<U+XXXX>` by the diagnostic engine itself is > perfectly fine and good; particularly when it has no context to suggest a > different presentation. In this particular case, where the character is being > presented using C++ syntax as a character literal, I would prefer that C++ > syntax be used consistently. > > From an implementation standpoint, I'm suggesting that > `WriteCharValueForDiagnostic()` be modified such that, if > `escapeCStyle<EscapeChar::Single>()` returns an empty string, that the > character be presented in `'\u{XXXX}'` form if the character is one that > would otherwise be substituted by the diagnostic engine (e.g., if > `isPrintable()` is false). Note that this would be restricted to `char` > values <= 0x7F; larger values could still be passed through as invalid code > units that the diagnostic engine would then render as, e.g., `'<FC>'`. We should take a decision before forcing the author to do further change as to avoid going in circle. As a user `\u{XXXX}` vs `<U+XXXX>` makes no difference in terms of the amount of information i receive. I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid `char`. But I'm also concerned about spending too much time on `char` literals in `static_assert`. It might not be a common enough use case to warrant that much scrutiny :) So I would be happy to go with _any_ direction ================ Comment at: clang/test/SemaCXX/static-assert.cpp:287 + static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \ + // expected-note {{evaluates to ''ゆ' (0x3086) == '̵' (0x335)'}} + static_assert(L"\/"[1] == u'\xFFFD', ""); // expected-error {{failed}} \ ---------------- hubert.reinterpretcast wrote: > tahonermann wrote: > > cor3ntin wrote: > > > hubert.reinterpretcast wrote: > > > > hubert.reinterpretcast wrote: > > > > > cor3ntin wrote: > > > > > > hubert.reinterpretcast wrote: > > > > > > > The C++23 escaped string formatting facility would not generate a > > > > > > > trailing combining character like this. I recommend following > > > > > > > suit. > > > > > > > > > > > > > > Info on U+0335: > > > > > > > https://util.unicode.org/UnicodeJsps/character.jsp?a=0335 > > > > > > > > > > > > > This is way outside the scope of the patch. The diagnostic output > > > > > > facility has no understanding of combining characters or graphemes > > > > > > and do not attempt to match std::print. It probably would be an > > > > > > improvement but this patch is not trying to modify how all > > > > > > diagnostics are printed. (all of that logic is in Diagnostic.cpp) > > > > > This patch is pushing the envelope of what appears in diagnostics. > > > > > One can also argue that someone writing > > > > > ``` > > > > > static_assert(false, "\u0301"); > > > > > ``` > > > > > gets what they deserve, but that case does not have a big problem > > > > > anyway (because the provided message text appears after `: `). > > > > > > > > > > This patch increases the exposure of the diagnostic output facility > > > > > to input that it does not handle well. I disagree that it is outside > > > > > the scope of this patch to insist that it does not generate such > > > > > inputs to the diagnostic output facility (even if a possible solution > > > > > is to modify the diagnostic output facility first). > > > > @cor3ntin, do you have status quo examples for how grapheme-extending > > > > characters that are not already "problematic" in their original context > > > > are emitted in diagnostics in contexts where they are? > > > Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me > > > That shows that clang has no understanding of graphemes > > gcc and MSVC get that case "right" (probably by accident). > > https://godbolt.org/z/Tjd6xnEon > I was more looking for cases where the output grapheme includes elements that > were part of the fixed message text (gobbling quotes, etc.). Also, this patch > is in the wrong shape for handling this concern in the diagnostic printer > because the delimiting of the replacement text happens in this patch. Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and `static_assert` diagnostics are not of the right shape. > This patch increases the exposure of the diagnostic output facility to input > that it does not handle well. I disagree that it is outside the scope of this > patch to insist that it does not generate such inputs to the diagnostic > output facility (even if a possible solution is to modify the diagnostic > output facility first). I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable CHANGES SINCE LAST ACTION https://reviews.llvm.org/D155610/new/ https://reviews.llvm.org/D155610 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits