LiteralSupport.cpp

Seth Cantrell Sun, 22 Jan 2012 06:15:37 -0800

Okay, this isn't a patch, I just want to know if this is the direction you're looking for. The included changes will result in an error message like:

tmp.cpp:3:10: error: illegal sequence in character literal
char a = '<f1>';
^

It doesn't do anything to try to make the highlighted ranges look correct for the modified source line. This will cause a problem similar to the one that already exists with multi-byte UTF-8 encodings, where carets and ranges do not appear correct after these sequences.

Also, I think it would be better if these <XX> strings could be displayed with a reversed color scheme, just like vi, if the console supports it. Is there any possibility of doing that?

Anyway, here are the changes I am contemplating:

diff --git a/lib/Frontend/TextDiagnostic.cpp b/lib/Frontend/TextDiagnostic.cpp
index d2b8660..464b81b 100644
--- a/lib/Frontend/TextDiagnostic.cpp
+++ b/lib/Frontend/TextDiagnostic.cpp
@@ -10,6 +10,7 @@
#include "clang/Frontend/TextDiagnostic.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/SourceManager.h"
+#include "clang/Basic/ConvertUTF.h"
#include "clang/Frontend/DiagnosticOptions.h"
#include "clang/Lex/Lexer.h"
#include "llvm/Support/MemoryBuffer.h"
@@ -17,6 +18,10 @@
#include "llvm/Support/ErrorHandling.h"
#include "llvm/ADT/SmallString.h"
#include <algorithm>
+
+#include <sstream>
+#include <iomanip>
+
using namespace clang;

static const enum raw_ostream::Colors noteColor =
@@ -592,6 +597,36 @@ void TextDiagnostic::emitSnippetAndCaret(
// Copy the line of code into an std::string for ease of manipulation.
std::string SourceLine(LineStart, LineEnd);

+
+ static const char trailingBytesForUTF8[256] = {
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
+ };
+
+ unsigned char const *i = reinterpret_cast<unsigned char const *>(LineStart);
+ unsigned char const *end = reinterpret_cast<unsigned char const *>(LineEnd);
+ std::string SourceLineMod;
+ while(i!=end) {
+ if(isLegalUTF8Sequence(i,end)) {
+ std::copy(i,i+trailingBytesForUTF8[*i]+1,back_inserter(SourceLineMod));
+ i+=trailingBytesForUTF8[*i]+1;
+ } else {
+ std::stringstream ss;
+ ss.fill('0');
+ ss << std::hex << '<' << std::setw(2) << (unsigned int)*i << '>';
+ SourceLineMod.append(ss.str());
+ ++i;
+ }
+ }
+
+ SourceLine = SourceLineMod;
+
// Create a line for the caret that is filled with spaces that is the same
// length as the line of source code.
std::string CaretLine(LineEnd-LineStart, ' ');

On Jan 19, 2012, at 10:36 PM, Seth Cantrell wrote:

Here's a patch that improves that error message. A patch that improves the display of encoding errors will take me longer to get to.

On Jan 18, 2012, at 8:09 PM, Eli Friedman wrote:

On Wed, Jan 18, 2012 at 4:44 PM, Seth Cantrell <seth.cantr...@gmail.com> wrote:

On Jan 18, 2012, at 5:49 PM, Eli Friedman wrote:

On Wed, Jan 18, 2012 at 4:27 AM, Seth Cantrell <seth.cantr...@gmail.com> wrote:
+ while (begin!=end) {
+    // Is this a span of non-escape characters?
+    if (begin[0] != '\\') {
+      char const *start = begin;
+      do {
+        ++begin;
+      } while (begin != end && *begin != '\\');
+
+      uint32_t *tmp_begin = buffer_begin;
+      ConversionResult res =
+      ConvertUTF8toUTF32(reinterpret_cast<UTF8 const **>(&start),
+                         reinterpret_cast<UTF8 const *>(begin),
+                         &buffer_begin,buffer_end,strictConversion);
+      if (res!=conversionOK) {
+        PP.Diag(Loc, diag::err_bad_character_encoding);

This error message can lead to rather uninformative complaints which
look like the following:

fribidi_char_sets_cp1256.c:214:9: error:
    illegal sequence in character literal
return '?';
      ^

Any ideas for how we could improve this diagnostic?

-Eli

I suppose a marginally better message could be 'illegal character encoding in character literal'.

Yes, that would be a bit better.

It'd also be good if the actual bytes could be highlighted. Something like vi's method of displaying illegal encodings using reversed colors would work to display them on the command line. Also adding a range to highlight the exact issue inside the literal. We'd need a way to calculate the locations for bytes inside the literal (there's a method there that looks like it works only for purely ascii strings). The console display for such ranges would need to be smarter about displaying ranges for lines that include multi-byte characters and also know about whatever method is chosen to show illegal bytes.

fribidi_char_sets_cp1256.c:214:9: error:
    illegal character encoding in character literal
return '123<F1>';
      ^   ~~~~

Displaying illegal bytes vi-style would be a big improvement. Hacking
up TextDiagnostic::emitSnippetAndCaret to do that should be
straightforward, given a function to figure out whether a byte is
illegal.

-Eli
<0002-improve-error-message-for-file-encoding-errors.patch>

_______________________________________________
cfe-commits mailing list
cfe-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [cfe-commits] r148389 - in /cfe/trunk: include/clang/Basic/DiagnosticLexKinds.td lib/Lex/LiteralSupport.cpp

Reply via email to