[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Corentin Jabot via Phabricator via cfe-commits Sat, 16 Sep 2023 04:09:06 -0700

cor3ntin added inline comments.


================
Comment at: clang/test/SemaCXX/static-assert-cxx26.cpp:304
+static_assert('\u{9}' == (char)1, ""); // expected-error {{failed}} \
+                                       // expected-note {{evaluates to ''\t' 
(0x09, 9) == '<U+0001>' (0x01, 1)'}}
+static_assert((char8_t)-128 == (char8_t)-123, ""); // expected-error 
{{failed}} \
----------------
tahonermann wrote:
> cor3ntin wrote:
> > tahonermann wrote:
> > > cor3ntin wrote:
> > > > tahonermann wrote:
> > > > > Is the expected note up to date? I don't see code that would generate 
> > > > > the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, 
> > > > > though non-printable, character, I would expect more `'\u0001'`.
> > > > See elsewhere in the discussion. this formating is pre existing and 
> > > > managed at the DiagnosticEngine level (pushEscapedString). the reason 
> > > > it's not `\u0001` is 1/ to avoid  reusing c++ syntactic elements for 
> > > > something that comes from diagnostics and is not represented as an 
> > > > escaped sequence in source 2/ `\u00011` is unreadable, and 
> > > > `\U000000001` is also not helpful :)
> > > > 
> > > Thanks for the explanation. I'm not sure that I agree with the rationale 
> > > for (1) though. We're already putting the value in single quotes and 
> > > representing some values with escapes in many of these cases when the 
> > > value isn't produced by an escape sequence (or even a character/string 
> > > literal); why exclude `\uXXXX`? I agree with the rationale for (2); we 
> > > could use `'\u{1}'` in that case.
> > FYI afaik the notation in clang predates the existence of \u{} by a few 
> > years, and follow Unicode notation 
> > (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html).
> > Oldest instance seems to be 
> > https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db
> >  - i followed suite when reworking the generic escaping mechanism all 
> > string fed to diagnostics go through.
> > 
> > I don't care about changing the syntax, but i do hope we are consistent. 
> > Ultimately what we are trying to do is to designate a unicode codepoint and 
> > whether we do it through C++ syntax or not probably does not matter much as 
> > long as it's clear, delimited and consistent!
> I think the substitution of `<U+XXXX>` by the diagnostic engine itself is 
> perfectly fine and good; particularly when it has no context to suggest a 
> different presentation. In this particular case, where the character is being 
> presented using C++ syntax as a character literal, I would prefer that C++ 
> syntax be used consistently.
> 
> From an implementation standpoint, I'm suggesting that 
> `WriteCharValueForDiagnostic()` be modified such that, if 
> `escapeCStyle<EscapeChar::Single>()` returns an empty string, that the 
> character be presented in `'\u{XXXX}'` form if the character is one that 
> would otherwise be substituted by the diagnostic engine (e.g., if 
> `isPrintable()` is false). Note that this would be restricted to `char` 
> values <= 0x7F; larger values could still be passed through as invalid code 
> units that the diagnostic engine would then render as, e.g., `'<FC>'`.
We should take a decision before forcing the author to do further change as to 
avoid going in circle.
As a user `\u{XXXX}` vs `<U+XXXX>` makes no difference in terms of the amount 
of information i receive.

I'm really not fan of duplicating code, spreading the logic in multiple places 
and having multiple ways to render a an invalid `char`. But I'm also concerned 
about spending too much time on `char` literals in `static_assert`. It might 
not be a common enough use case to warrant that much scrutiny :)

So I would be happy to go with _any_ direction


================
Comment at: clang/test/SemaCXX/static-assert.cpp:287
+  static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error 
{{failed}} \
+                                                  // expected-note {{evaluates 
to ''ゆ' (0x3086) == '̵' (0x335)'}}
+  static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \
----------------
hubert.reinterpretcast wrote:
> tahonermann wrote:
> > cor3ntin wrote:
> > > hubert.reinterpretcast wrote:
> > > > hubert.reinterpretcast wrote:
> > > > > cor3ntin wrote:
> > > > > > hubert.reinterpretcast wrote:
> > > > > > > The C++23 escaped string formatting facility would not generate a 
> > > > > > > trailing combining character like this. I recommend following 
> > > > > > > suit.
> > > > > > > 
> > > > > > > Info on U+0335: 
> > > > > > > https://util.unicode.org/UnicodeJsps/character.jsp?a=0335
> > > > > > > 
> > > > > > This is way outside the scope of the patch. The diagnostic output 
> > > > > > facility has no understanding of combining characters or graphemes 
> > > > > > and do not attempt to match std::print. It probably would be an 
> > > > > > improvement but this patch is not trying to modify how all 
> > > > > > diagnostics are printed. (all of that logic is in Diagnostic.cpp)
> > > > > This patch is pushing the envelope of what appears in diagnostics. 
> > > > > One can also argue that someone writing
> > > > > ```
> > > > > static_assert(false, "\u0301");
> > > > > ```
> > > > > gets what they deserve, but that case does not have a big problem 
> > > > > anyway (because the provided message text appears after `: `).
> > > > > 
> > > > > This patch increases the exposure of the diagnostic output facility 
> > > > > to input that it does not handle well. I disagree that it is outside 
> > > > > the scope of this patch to insist that it does not generate such 
> > > > > inputs to the diagnostic output facility (even if a possible solution 
> > > > > is to modify the diagnostic output facility first).
> > > > @cor3ntin, do you have status quo examples for how grapheme-extending 
> > > > characters that are not already "problematic" in their original context 
> > > > are emitted in diagnostics in contexts where they are?
> > > Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me
> > > That shows that clang has no understanding of graphemes
> > gcc and MSVC get that case "right" (probably by accident). 
> > https://godbolt.org/z/Tjd6xnEon
> I was more looking for cases where the output grapheme includes elements that 
> were part of the fixed message text (gobbling quotes, etc.). Also, this patch 
> is in the wrong shape for handling this concern in the diagnostic printer 
> because the delimiting of the replacement text happens in this patch.
Could you craft a message that becomes a graphene after substitution by the 
engine? Maybe? You would have to try very hard and `static_assert` diagnostics 
are not of the right shape.

> This patch increases the exposure of the diagnostic output facility to input 
> that it does not handle well. I disagree that it is outside the scope of this 
> patch to insist that it does not generate such inputs to the diagnostic 
> output facility (even if a possible solution is to modify the diagnostic 
> output facility first).

I still don't see it. None of the output produce in this patch are even close 
to what i could be problematic. ie this patch is only ever producing ASCII or 
single codepoints that gets escaped when they are not printable


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155610/new/

https://reviews.llvm.org/D155610

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Reply via email to