[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Aaron Ballman via Phabricator via cfe-commits Wed, 22 Jun 2022 12:06:37 -0700

aaron.ballman added inline comments.


================
Comment at: clang/lib/Lex/Lexer.cpp:2717
+        __m128i Bytes =
+            _mm_loadu_si128(reinterpret_cast<const __m128i *>(CurPtr));
+        int Mask = _mm_movemask_epi8(Bytes);
----------------
cor3ntin wrote:
> This crashes when using `_mm_load_si128` which suprises me because `CurPtr` 
> is supposedly aligned on a 16 bytes boundary here. Any idea?
Wait, did you verify that `CurPtr` really is on a 16-byte boundary, or are you 
thinking it should be on such a boundary? (I don't see any alignment markings 
on the parameter, so I'd assume it's aligned as any other pointer.)


================
Comment at: clang/lib/Lex/Lexer.cpp:2405-2406
     // Skip over characters in the fast loop.
-    while (C != 0 &&                // Potentially EOF.
-           C != '\n' && C != '\r')  // Newline or DOS-style newline.
+    // Warn on invalid UTF-8 if the corresponding warning is enabled, emitting 
a
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
----------------
tahonermann wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > tahonermann wrote:
> > > > I think it would be helpful to include a link to [[ 
> > > > http://unicode.org/review/pr-121.html | Unicode PR issue 121 ]] here 
> > > > and note that the diagnostic follows policy option 1. Likewise for 
> > > > handling of '//' comment styles below. Alternatively, provide the link 
> > > > and a note in the release notes.
> > > +1 -- I hadn't realized there were other options. However, ultimately I 
> > > think this should be committed when WG21 has adopted the paper and then 
> > > it would be even better to have a comment citing which stable name and 
> > > paragraph number the code is implementing from P2295.
> > I don't think this applies as we are not inserting replacement characters, 
> > but I'm all for adding a comment.
> > I don't know if we should reference the c++ wording though, as I don;t 
> > think there is value in not applying that behavior in all language modes.
> Perhaps phrase it something like "diagnostics are issued consistent with the 
> replacement character insertion strategy of Unicode PR-121 policy option 1". 
> The goal is to make it clear that the diagnostic strategy is not out of thin 
> air; that it follows Unicode endorsed behavior.
Tom covered why I would prefer to see a comment, but as for the standards 
reference -- we already put in references to only one language for behavior we 
treat as an extension, so this would be more of the same. That ends up being 
helpful (especially if there's an explicit comment about it being an extension 
in other languages) because it tells the reader how the extension is roughly 
expected to behave as well.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Reply via email to