[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-23 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese closed 
https://github.com/llvm/llvm-project/pull/82568
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-23 Thread Michael Spencer via cfe-commits

Bigcheese wrote:

CI failure was a preexisting trailing whitespace issue.

https://github.com/llvm/llvm-project/pull/82568
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-23 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese updated 
https://github.com/llvm/llvm-project/pull/82568

>From a690c96562dea29a760390644d78a01a263993ca Mon Sep 17 00:00:00 2001
From: Michael Spencer 
Date: Fri, 16 Feb 2024 22:05:25 -0800
Subject: [PATCH] [clang][ScanDeps] Canonicalize -D and -U flags

Canonicalize `-D` and `-U` flags by sorting them and only keeping the
last instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with
a simple identifier that we can guarantee a simple analysis of can
determine if two flags refer to the same identifier or not. See the
comment on `getSimpleMacroName()` for details of what the issues are.
---
 .../DependencyScanningService.h   |  5 +-
 .../DependencyScanningWorker.cpp  | 74 
 .../optimize-canonicalize-macros.m| 87 +++
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  1 +
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 clang/test/ClangScanDeps/optimize-canonicalize-macros.m

diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.size());
+  for (std::size_t I = 0, E = SimpleNames.size(); I != E; ++I) {
+std::size_t OriginalIndex = SimpleNames[I].second;
+// We still emit undefines here as they may be undefining a predefined 
macro
+NewMacros.push_back(std::move(PPOpts.Macros[OriginalIndex]));
+  }
+  std::swap(PPOpts.Macros, NewMacros);
+}
+
 /// A clang tool that runs the preprocessor in a mode that's optimized for
 /// dependency scanning for the given compiler invocation.
 class DependencyScanningAction : public tooling::ToolAction {
@@ -203,6 +275,8 @@ class DependencyScanningAction : public tooling::ToolAc

[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-23 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese updated 
https://github.com/llvm/llvm-project/pull/82568

>From eb622c20b8d84afabdbb82543c1f9e4889639735 Mon Sep 17 00:00:00 2001
From: Michael Spencer 
Date: Fri, 16 Feb 2024 22:05:25 -0800
Subject: [PATCH] [clang][ScanDeps] Canonicalize -D and -U flags

Canonicalize `-D` and `-U` flags by sorting them and only keeping the
last instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with
a simple identifier that we can guarantee a simple analysis of can
determine if two flags refer to the same identifier or not. See the
comment on `getSimpleMacroName()` for details of what the issues are.
---
 .../DependencyScanningService.h   |  5 +-
 .../DependencyScanningWorker.cpp  | 74 
 .../optimize-canonicalize-macros.m| 87 +++
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  1 +
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 clang/test/ClangScanDeps/optimize-canonicalize-macros.m

diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.size());
+  for (std::size_t I = 0, E = SimpleNames.size(); I != E; ++I) {
+std::size_t OriginalIndex = SimpleNames[I].second;
+// We still emit undefines here as they may be undefining a predefined 
macro
+NewMacros.push_back(std::move(PPOpts.Macros[OriginalIndex]));
+  }
+  std::swap(PPOpts.Macros, NewMacros);
+}
+
 /// A clang tool that runs the preprocessor in a mode that's optimized for
 /// dependency scanning for the given compiler invocation.
 class DependencyScanningAction : public tooling::ToolAction {
@@ -203,6 +275,8 @@ class DependencyScanningAction : public tooling::ToolAc

[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-22 Thread Michael Spencer via cfe-commits

Bigcheese wrote:

Try double quotes.

https://github.com/llvm/llvm-project/pull/82568
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-22 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese updated 
https://github.com/llvm/llvm-project/pull/82568

>From 9759145f34306f1832b1deff0ca1b5e41d2ad89d Mon Sep 17 00:00:00 2001
From: Michael Spencer 
Date: Fri, 16 Feb 2024 22:05:25 -0800
Subject: [PATCH] [clang][ScanDeps] Canonicalize -D and -U flags

Canonicalize `-D` and `-U` flags by sorting them and only keeping the
last instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with
a simple identifier that we can guarantee a simple analysis of can
determine if two flags refer to the same identifier or not. See the
comment on `getSimpleMacroName()` for details of what the issues are.
---
 .../DependencyScanningService.h   |  5 +-
 .../DependencyScanningWorker.cpp  | 74 
 .../optimize-canonicalize-macros.m| 87 +++
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  1 +
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 clang/test/ClangScanDeps/optimize-canonicalize-macros.m

diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.size());
+  for (std::size_t I = 0, E = SimpleNames.size(); I != E; ++I) {
+std::size_t OriginalIndex = SimpleNames[I].second;
+// We still emit undefines here as they may be undefining a predefined 
macro
+NewMacros.push_back(std::move(PPOpts.Macros[OriginalIndex]));
+  }
+  std::swap(PPOpts.Macros, NewMacros);
+}
+
 /// A clang tool that runs the preprocessor in a mode that's optimized for
 /// dependency scanning for the given compiler invocation.
 class DependencyScanningAction : public tooling::ToolAction {
@@ -203,6 +275,8 @@ class DependencyScanningAction : public tooling::ToolAc

[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-21 Thread Michael Spencer via cfe-commits

Bigcheese wrote:

Windows didn't like the quoted argument, now let's see if Linux is happy with 
an unquoted argument.

https://github.com/llvm/llvm-project/pull/82568
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-21 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese updated 
https://github.com/llvm/llvm-project/pull/82568

>From d8bfbdeedbf0a3bdd2db25e7dd389d6f223091a3 Mon Sep 17 00:00:00 2001
From: Michael Spencer 
Date: Fri, 16 Feb 2024 22:05:25 -0800
Subject: [PATCH] [clang][ScanDeps] Canonicalize -D and -U flags

Canonicalize `-D` and `-U` flags by sorting them and only keeping the
last instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with
a simple identifier that we can guarantee a simple analysis of can
determine if two flags refer to the same identifier or not. See the
comment on `getSimpleMacroName()` for details of what the issues are.
---
 .../DependencyScanningService.h   |  5 +-
 .../DependencyScanningWorker.cpp  | 74 
 .../optimize-canonicalize-macros.m| 87 +++
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  1 +
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 clang/test/ClangScanDeps/optimize-canonicalize-macros.m

diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.size());
+  for (std::size_t I = 0, E = SimpleNames.size(); I != E; ++I) {
+std::size_t OriginalIndex = SimpleNames[I].second;
+// We still emit undefines here as they may be undefining a predefined 
macro
+NewMacros.push_back(std::move(PPOpts.Macros[OriginalIndex]));
+  }
+  std::swap(PPOpts.Macros, NewMacros);
+}
+
 /// A clang tool that runs the preprocessor in a mode that's optimized for
 /// dependency scanning for the given compiler invocation.
 class DependencyScanningAction : public tooling::ToolAction {
@@ -203,6 +275,8 @@ class DependencyScanningAction : public tooling::ToolAc

[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-21 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-clang

Author: Michael Spencer (Bigcheese)


Changes

Canonicalize `-D` and `-U` flags by sorting them and only keeping the last 
instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with a simple 
identifier that we can guarantee a simple analysis of can determine if two 
flags refer to the same identifier or not. See the comment on 
`getSimpleMacroName()` for details of what the issues are.

Previous version of this had issues with sed differences between macOS, Linux, 
and Windows. This test doesn't check paths, so just don't run sed.
Other tests should use `sed -E 's:?:/:g'` to get portable behavior.

---
Full diff: https://github.com/llvm/llvm-project/pull/82568.diff


4 Files Affected:

- (modified) 
clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
(+4-1) 
- (modified) clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
(+74) 
- (added) clang/test/ClangScanDeps/optimize-canonicalize-macros.m (+87) 
- (modified) clang/tools/clang-scan-deps/ClangScanDeps.cpp (+1) 


``diff
diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.size());
+  for (std::size_t I = 0, E = SimpleNames.size(); I != E; ++I) {
+std::size_t OriginalIndex = SimpleNames[I].second;
+// We still emit undefines here as they may be undefining a predefined 
macro
+NewMacros.push_back(std::move(PPOpts.Macros[OriginalIndex]));
+  }
+  std::swap(PPOpts.Macros, NewMacros);
+}
+
 /// A clang tool that runs the preprocessor in a mode that's optimized for
 /// dependency scanning for the given compiler invocation.
 class DependencyScanningAction : public tooling::ToolAct

[clang] reland: [clang][ScanDeps] Canonicalize -D and -U flags (PR #82568)

2024-02-21 Thread Michael Spencer via cfe-commits

https://github.com/Bigcheese created 
https://github.com/llvm/llvm-project/pull/82568

Canonicalize `-D` and `-U` flags by sorting them and only keeping the last 
instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with a simple 
identifier that we can guarantee a simple analysis of can determine if two 
flags refer to the same identifier or not. See the comment on 
`getSimpleMacroName()` for details of what the issues are.

Previous version of this had issues with sed differences between macOS, Linux, 
and Windows. This test doesn't check paths, so just don't run sed.
Other tests should use `sed -E 's:?:/:g'` to get portable behavior.

>From 21300748d867e321fc16ba51aea4d0330d8d27a6 Mon Sep 17 00:00:00 2001
From: Michael Spencer 
Date: Fri, 16 Feb 2024 22:05:25 -0800
Subject: [PATCH] [clang][ScanDeps] Canonicalize -D and -U flags

Canonicalize `-D` and `-U` flags by sorting them and only keeping the
last instance of a given name.

This optimization will only fire if all `-D` and `-U` flags start with
a simple identifier that we can guarantee a simple analysis of can
determine if two flags refer to the same identifier or not. See the
comment on `getSimpleMacroName()` for details of what the issues are.
---
 .../DependencyScanningService.h   |  5 +-
 .../DependencyScanningWorker.cpp  | 74 
 .../optimize-canonicalize-macros.m| 87 +++
 clang/tools/clang-scan-deps/ClangScanDeps.cpp |  1 +
 4 files changed, 166 insertions(+), 1 deletion(-)
 create mode 100644 clang/test/ClangScanDeps/optimize-canonicalize-macros.m

diff --git 
a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h 
b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
index 4f9867262a275c..557f0e547ab4a8 100644
--- a/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
+++ b/clang/include/clang/Tooling/DependencyScanning/DependencyScanningService.h
@@ -60,7 +60,10 @@ enum class ScanningOptimizations {
   /// Remove unused -ivfsoverlay arguments.
   VFS = 4,
 
-  DSS_LAST_BITMASK_ENUM(VFS),
+  /// Canonicalize -D and -U options.
+  Macros = 8,
+
+  DSS_LAST_BITMASK_ENUM(Macros),
   Default = All
 };
 
diff --git a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp 
b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
index 3cf3ad8a4e4907..7477b930188b4f 100644
--- a/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
+++ b/clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
@@ -179,6 +179,78 @@ static void sanitizeDiagOpts(DiagnosticOptions &DiagOpts) {
   DiagOpts.IgnoreWarnings = true;
 }
 
+// Clang implements -D and -U by splatting text into a predefines buffer. This
+// allows constructs such as `-DFඞ=3 "-D F\u{0D9E} 4 3 2”` to be accepted and
+// define the same macro, or adding C++ style comments before the macro name.
+//
+// This function checks that the first non-space characters in the macro
+// obviously form an identifier that can be uniqued on without lexing. Failing
+// to do this could lead to changing the final definition of a macro.
+//
+// We could set up a preprocessor and actually lex the name, but that's very
+// heavyweight for a situation that will almost never happen in practice.
+static std::optional getSimpleMacroName(StringRef Macro) {
+  StringRef Name = Macro.split("=").first.ltrim(" \t");
+  std::size_t I = 0;
+
+  auto FinishName = [&]() -> std::optional {
+StringRef SimpleName = Name.slice(0, I);
+if (SimpleName.empty())
+  return std::nullopt;
+return SimpleName;
+  };
+
+  for (; I != Name.size(); ++I) {
+switch (Name[I]) {
+case '(': // Start of macro parameter list
+case ' ': // End of macro name
+case '\t':
+  return FinishName();
+case '_':
+  continue;
+default:
+  if (llvm::isAlnum(Name[I]))
+continue;
+  return std::nullopt;
+}
+  }
+  return FinishName();
+}
+
+static void canonicalizeDefines(PreprocessorOptions &PPOpts) {
+  using MacroOpt = std::pair;
+  std::vector SimpleNames;
+  SimpleNames.reserve(PPOpts.Macros.size());
+  std::size_t Index = 0;
+  for (const auto &M : PPOpts.Macros) {
+auto SName = getSimpleMacroName(M.first);
+// Skip optimizing if we can't guarantee we can preserve relative order.
+if (!SName)
+  return;
+SimpleNames.emplace_back(*SName, Index);
+++Index;
+  }
+
+  llvm::stable_sort(SimpleNames, [](const MacroOpt &A, const MacroOpt &B) {
+return A.first < B.first;
+  });
+  // Keep the last instance of each macro name by going in reverse
+  auto NewEnd = std::unique(
+  SimpleNames.rbegin(), SimpleNames.rend(),
+  [](const MacroOpt &A, const MacroOpt &B) { return A.first == B.first; });
+  SimpleNames.erase(SimpleNames.begin(), NewEnd.base());
+
+  // Apply permutation.
+  decltype(PPOpts.Macros) NewMacros;
+  NewMacros.reserve(SimpleNames.si