[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-08 Thread Owen Pan via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL360257: [clang-format] Fix the crash when formatting 
unsupported encodings (authored by owenpan, committed by ).
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Changed prior to commit:
  https://reviews.llvm.org/D61559?vs=198652=198655#toc

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559

Files:
  cfe/trunk/tools/clang-format/ClangFormat.cpp


Index: cfe/trunk/tools/clang-format/ClangFormat.cpp
===
--- cfe/trunk/tools/clang-format/ClangFormat.cpp
+++ cfe/trunk/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SCSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;


Index: cfe/trunk/tools/clang-format/ClangFormat.cpp
===
--- cfe/trunk/tools/clang-format/ClangFormat.cpp
+++ cfe/trunk/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SCSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-08 Thread Owen Pan via Phabricator via cfe-commits
owenpan marked an inline comment as done.
owenpan added inline comments.



Comment at: clang/tools/clang-format/ClangFormat.cpp:273
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")

sammccall wrote:
> Seems unlikely we'll ever see any of these other than UTF{16,32}.
> I'd suggest dropping them, but up to you.
I will keep the rare BOM cases to keep it in sync with D61628.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-08 Thread Owen Pan via Phabricator via cfe-commits
owenpan updated this revision to Diff 198652.
owenpan added a comment.

Fixed the typo for SCSU.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559

Files:
  clang/tools/clang-format/ClangFormat.cpp


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SCSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SCSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-07 Thread Owen Pan via Phabricator via cfe-commits
owenpan added a comment.

I copied the code from `clang/lib/Basic/SourceManager.cpp`. See D61628 
.

I will update this patch to correct the typo in SCSU or remove it along with 
the other rare BOMs. How do you add test cases for this kind of fixes that emit 
error messages?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-06 Thread Sam McCall via Phabricator via cfe-commits
sammccall added a comment.

You might want to add a test for one of these?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-06 Thread Sam McCall via Phabricator via cfe-commits
sammccall accepted this revision.
sammccall added inline comments.
This revision is now accepted and ready to land.



Comment at: clang/tools/clang-format/ClangFormat.cpp:273
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")

Seems unlikely we'll ever see any of these other than UTF{16,32}.
I'd suggest dropping them, but up to you.



Comment at: clang/tools/clang-format/ClangFormat.cpp:276
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SDSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")

SCSU?


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-05 Thread Owen Pan via Phabricator via cfe-commits
owenpan updated this revision to Diff 198205.
owenpan added a comment.

Moved "UTF-32 (LE)" to before "UTF-16 (LE)" in `llvm::StringSwitch` so that the 
former BOM wouldn't be misnamed as the latter.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61559/new/

https://reviews.llvm.org/D61559

Files:
  clang/tools/clang-format/ClangFormat.cpp


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SDSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,36 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding
+  // for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SDSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D61559: Fix the crash when formatting unsupported encodings

2019-05-04 Thread Owen Pan via Phabricator via cfe-commits
owenpan created this revision.
owenpan added reviewers: klimek, djasper, sammccall, MyDeveloperDay, krasimir.
owenpan added a project: clang.
Herald added a subscriber: cfe-commits.

See PR33946 .


Repository:
  rC Clang

https://reviews.llvm.org/D61559

Files:
  clang/tools/clang-format/ClangFormat.cpp


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,35 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // http://en.wikipedia.org/wiki/Byte_order_mark for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SDSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;


Index: clang/tools/clang-format/ClangFormat.cpp
===
--- clang/tools/clang-format/ClangFormat.cpp
+++ clang/tools/clang-format/ClangFormat.cpp
@@ -257,6 +257,35 @@
   std::unique_ptr Code = std::move(CodeOrErr.get());
   if (Code->getBufferSize() == 0)
 return false; // Empty files are formatted correctly.
+
+  // Check to see if the buffer has a UTF Byte Order Mark (BOM).
+  // We only support UTF-8 with and without a BOM right now.  See
+  // http://en.wikipedia.org/wiki/Byte_order_mark for more information.
+  StringRef BufStr = Code->getBuffer();
+  const char *InvalidBOM = llvm::StringSwitch(BufStr)
+.StartsWith("\xFE\xFF", "UTF-16 (BE)")
+.StartsWith("\xFF\xFE", "UTF-16 (LE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\x00\x00\xFE\xFF"),
+  "UTF-32 (BE)")
+.StartsWith(llvm::StringLiteral::withInnerNUL("\xFF\xFE\x00\x00"),
+  "UTF-32 (LE)")
+.StartsWith("\x2B\x2F\x76", "UTF-7")
+.StartsWith("\xF7\x64\x4C", "UTF-1")
+.StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
+.StartsWith("\x0E\xFE\xFF", "SDSU")
+.StartsWith("\xFB\xEE\x28", "BOCU-1")
+.StartsWith("\x84\x31\x95\x33", "GB-18030")
+.Default(nullptr);
+
+  if (InvalidBOM) {
+errs() << "error: encoding with unsupported byte order mark \""
+   << InvalidBOM << "\" detected";
+if (FileName != "-")
+  errs() << " in file '" << FileName << "'";
+errs() << ".\n";
+return true;
+  }
+
   std::vector Ranges;
   if (fillRanges(Code.get(), Ranges))
 return true;
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits