[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-17 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added inline comments.



Comment at: clang/include/clang/Serialization/ModuleFile.h:399
+  /// include information.
+  llvm::StringMap> SubmoduleIncludedFiles;
+

jansvoboda11 wrote:
> dexonsmith wrote:
> > Each StringMapEntry is going to have a pretty big allocation here, for a 
> > 512B vector. Given that it doesn't need to be after creation, it'd be more 
> > efficient to use this pattern:
> > ```
> > lang=c++
> > llvm::StringMap> SubmoduleIncludedFiles;
> > SpecificBumpPtrAlloc SubmoduleIncludedFilesAlloc;
> > 
> > // later
> > MutableArrayRef 
> > Files(SubmoduleIncludedFiles.Allocate(Record.size()), Record.size());
> > llvm::copy(Record, Files.begin());
> > SubmoduleIncludedFiles[Key] = Files;
> > ```
> > 
> > That said, I feel like this should just be:
> > ```
> > lang=c++
> > DenseMap SubmoduleIncludedFiles;
> > ```
> > The key can point at the name of the submodule, which must already exist 
> > somewhere without needing a StringMap to create a new copy of it. The value 
> > is a lazily deserialized blob.
> I switched to `StringRef` value in the latest revision. Unfortunately, had to 
> use `std::string` as the key instead of `StringRef`, since 
> `getFullModuleName` constructs the string on heap. That forced me to use 
> `std::map`, too. I'll explore using something else entirely as the key.
Oh, if the key isn't being kept alive elsewhere, you can/should use 
`StringMap`, which is strictly better than `std::map` (except for non-deterministic iteration order). I suggested 
otherwise because I assumed the key already existed.

Also, if the map isn't being deleted from (much), might as well 
bump-ptr-allocate it:
```
lang=c++
BumpPtrAllocator SubmoduleIncludedFilesAlloc;
StringMap SubmoduleIncludedFiles; // iniitalized 
with teh Alloc above.
```

> I'll explore using something else entirely as the key.

Not necessarily important; just if the key was already (e.g., stored in its 
`Module` or something) you might as well use it. Unless, maybe if the `Module` 
is known to be constructed already, you could use its address?



Comment at: clang/include/clang/Serialization/ModuleFile.h:399
+  /// include information.
+  llvm::StringMap> SubmoduleIncludedFiles;
+

dexonsmith wrote:
> jansvoboda11 wrote:
> > dexonsmith wrote:
> > > Each StringMapEntry is going to have a pretty big allocation here, for a 
> > > 512B vector. Given that it doesn't need to be after creation, it'd be 
> > > more efficient to use this pattern:
> > > ```
> > > lang=c++
> > > llvm::StringMap> SubmoduleIncludedFiles;
> > > SpecificBumpPtrAlloc SubmoduleIncludedFilesAlloc;
> > > 
> > > // later
> > > MutableArrayRef 
> > > Files(SubmoduleIncludedFiles.Allocate(Record.size()), Record.size());
> > > llvm::copy(Record, Files.begin());
> > > SubmoduleIncludedFiles[Key] = Files;
> > > ```
> > > 
> > > That said, I feel like this should just be:
> > > ```
> > > lang=c++
> > > DenseMap SubmoduleIncludedFiles;
> > > ```
> > > The key can point at the name of the submodule, which must already exist 
> > > somewhere without needing a StringMap to create a new copy of it. The 
> > > value is a lazily deserialized blob.
> > I switched to `StringRef` value in the latest revision. Unfortunately, had 
> > to use `std::string` as the key instead of `StringRef`, since 
> > `getFullModuleName` constructs the string on heap. That forced me to use 
> > `std::map`, too. I'll explore using something else entirely as the key.
> Oh, if the key isn't being kept alive elsewhere, you can/should use 
> `StringMap`, which is strictly better than `std::map StringRef>` (except for non-deterministic iteration order). I suggested 
> otherwise because I assumed the key already existed.
> 
> Also, if the map isn't being deleted from (much), might as well 
> bump-ptr-allocate it:
> ```
> lang=c++
> BumpPtrAllocator SubmoduleIncludedFilesAlloc;
> StringMap SubmoduleIncludedFiles; // iniitalized 
> with teh Alloc above.
> ```
> 
> > I'll explore using something else entirely as the key.
> 
> Not necessarily important; just if the key was already (e.g., stored in its 
> `Module` or something) you might as well use it. Unless, maybe if the 
> `Module` is known to be constructed already, you could use its address?
> I switched to `StringRef` value in the latest revision. Unfortunately, had to 
> use `std::string` as the key instead of `StringRef`, since 
> `getFullModuleName` constructs the string on heap. That forced me to use 
> `std::map`, too. I'll explore using something else entirely as the key.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-17 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 marked 4 inline comments as done.
jansvoboda11 added a comment.

In D112915#3136492 , @vsapsai wrote:

> Didn't go in-depth for serialization/deserialization. When we add tracking 
> `isImport` on per-submodule basis, do you think AST reading/writing would 
> change significantly?

I think moving `isImport` into `Preprocessor` can be done in a similar way to 
how we handle the number of includes (or rather "has been included/imported" 
behavior), so we should be able to reuse some parts of this patch there.

One question that remains to be answered is whether to keep two separate 
`DenseSet` instances, or merge them into something like:

  struct IncludeInfo {
bool IsImportedOrIncluded; // currently Preprocessor::IncludedFiles
bool IsImported;   // currently HeaderFileInfo::isImport
  }
  llvm::DenseMap IncludedFilesInfo;

Thanks a lot for the feedback, I appreciate it.




Comment at: clang/include/clang/Lex/ExternalPreprocessorSource.h:45
+  /// Return the files directly included in the given (sub)module.
+  virtual const llvm::DenseMap *
+  getIncludedFiles(Module *M) = 0;

vsapsai wrote:
> I think it is better for understanding and more convenient to use some 
> `using` instead of duplicating `llvm::DenseMap` 
> in multiple places.
Yeah, spelling the type out everywhere is a bit unwieldy. It's a bit better 
with `llvm::DenseSet` in the latest revision, but still not 
pretty. I wanted to avoid pulling `Preprocessor.`h everywhere, but will 
probably reconsider doing that.



Comment at: clang/include/clang/Lex/Preprocessor.h:775-781
+  struct SubmoduleIncludeState {
+/// For each included file, we track the number of includes.
+llvm::DenseMap IncludedFiles;
+
+/// The set of modules that are visible within the submodule.
+VisibleModuleSet VisibleModules;
+  };

vsapsai wrote:
> I think the interplay between `CurSubmoduleIncludeState`, `IncludedFiles`, 
> and `CurSubmoduleState` is pretty complicated. Recently I've realized that it 
> can be beneficial to distinguish include tracking for the purpose of 
> serializing per submodule and for the purpose of deciding if should enter a 
> file. In D114051 I've tried to illustrate this approach. There are some tests 
> failing but hope the main idea is still understandable.
> 
> One of my big concerns is tracking `VisibleModules` in multiple places. 
> D114051 demonstrates one of the ways to deal with it but I think it is more 
> important for you to know the problem I was trying to solve, not the solution 
> I came up with.
Thanks a ton, this must've taken quite a bit of time to put together. I agree 
your approach is much simpler. I'll investigate how it behaves on larger 
projects, but probably will end up adopting it in this patch.

I have done some minor tweaks locally to fix the failing PCH tests, 
`Modules/import-textual-noguard.mm` doesn't make much sense to me, I think I'll 
end up updating that test.



Comment at: clang/lib/Lex/Preprocessor.cpp:1336
+  [&](Module *M) {
+const auto *Includes = getLocalSubmoduleIncludes(M);
+if (!Includes)

vsapsai wrote:
> If I drop checking `getLocalSubmoduleIncludes`, no tests are failing. But it 
> seems like this call is required. How can we test it?
I think this should kick in when importing a submodule from the same module. 
I'll try to come up with a test case.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-17 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 marked 5 inline comments as done.
jansvoboda11 added a comment.

In D112915#3122340 , @dexonsmith 
wrote:

> Implementation looks a lot cleaner!
>
> I'd still like to drop NumIncludes first if possible because I think it'll be 
> easier to reason about without this extra layer of complexity. Also, that'd 
> mitigate the potential regression in `.pcm` size.

Done in D114096 . Thanks for the feedback.




Comment at: clang/include/clang/Lex/HeaderSearch.h:133-139
-
-  /// Determine whether this is a non-default header file info, e.g.,
-  /// it corresponds to an actual header we've included or tried to include.
-  bool isNonDefault() const {
-return isImport || isPragmaOnce || NumIncludes || ControllingMacro ||
-  ControllingMacroID;
-  }

dexonsmith wrote:
> Looks like this is already dead code? If so, please separate out and commit 
> ahead of time (e.g., now).
Done in D114092.



Comment at: clang/include/clang/Serialization/ModuleFile.h:399
+  /// include information.
+  llvm::StringMap> SubmoduleIncludedFiles;
+

dexonsmith wrote:
> Each StringMapEntry is going to have a pretty big allocation here, for a 512B 
> vector. Given that it doesn't need to be after creation, it'd be more 
> efficient to use this pattern:
> ```
> lang=c++
> llvm::StringMap> SubmoduleIncludedFiles;
> SpecificBumpPtrAlloc SubmoduleIncludedFilesAlloc;
> 
> // later
> MutableArrayRef 
> Files(SubmoduleIncludedFiles.Allocate(Record.size()), Record.size());
> llvm::copy(Record, Files.begin());
> SubmoduleIncludedFiles[Key] = Files;
> ```
> 
> That said, I feel like this should just be:
> ```
> lang=c++
> DenseMap SubmoduleIncludedFiles;
> ```
> The key can point at the name of the submodule, which must already exist 
> somewhere without needing a StringMap to create a new copy of it. The value 
> is a lazily deserialized blob.
I switched to `StringRef` value in the latest revision. Unfortunately, had to 
use `std::string` as the key instead of `StringRef`, since `getFullModuleName` 
constructs the string on heap. That forced me to use `std::map`, too. I'll 
explore using something else entirely as the key.



Comment at: clang/lib/Lex/Preprocessor.cpp:1323
+auto  = SubmoduleIncludeStates[M].NumIncludes;
+for (unsigned UID = 0; UID <= ModNumIncludes.maxUID(); ++UID) {
+  CurSubmoduleIncludeState->NumIncludes[UID] += ModNumIncludes[UID];

dexonsmith wrote:
> jansvoboda11 wrote:
> > dexonsmith wrote:
> > > jansvoboda11 wrote:
> > > > Iterating over all FileEntries is probably not very efficient, as 
> > > > Volodymyr mentioned. Thinking about how to make this more efficient...
> > > My suggestion above to drop FileEntryMap in favour of a simple DenseMap 
> > > would help a bit, just iterating through the files actually included by 
> > > the submodules.
> > > 
> > > Further, I wonder if "num-includes"/file/submodule (`unsigned`) is 
> > > actually useful, vs. "was-included"/file/submodule (`bool`). The only 
> > > observer I see is `HeaderSearch::PrintStats()` but maybe I missed 
> > > something?
> > > 
> > > If I'm right and we can switch to `bool`, then NumIncludes becomes a 
> > > `DenseSet IncludedFiles` (or `DenseSet` for UIDs). 
> > > (BTW, I also wondered if you could rotate the map, using File as the 
> > > outer key, and then use bitsets for the sbumodules, but I doubt this is 
> > > better, since most files are only included by a few submodules, right?)
> > > 
> > > Then you can just do a set union here. Also simplifies bitcode 
> > > serialization.
> > > 
> > > (If a `bool`/set is sufficient, then I'd suggest landing that 
> > > first/separately, before adding the per-submodule granularity in this 
> > > patch.)
> > For each file, we need to have three distinct states: not included at all, 
> > included exactly once (`firstTimeLexingFile`), included more than once. 
> > This means we can't use a single `DenseSet`.
> > 
> > But we could use a `DenseMap`, where "not included at all" can 
> > be expressed as being absent from the map, exactly once as having `true` in 
> > the map and more than once as having `false` in the map. Alternatively, we 
> > could use two `DenseSet` instances to encode the same, but I don't think 
> > having two lookups per file to determine stuff is efficient.
> > 
> > I can look into this in a follow-up patch.
> Seems like a DenseSet could still be used by having HeaderInfo pass back the 
> WasInserted bit from the insertion to the preprocessor, and threading it 
> through to Preprocessor::HandleEndOfFile (the only caller of 
> FirstTimeLexingFile):
> ```
> lang=c++
> bool IsFirst = Set.insert(Key).second;
> ```
> The threading doesn't seem too hard. Looking at main:
> - Preprocessor::HandleHeaderIncludeOrImport calls 
> HeaderInfo::ShouldEnterIncludeFile. This does the 

[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-17 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 updated this revision to Diff 387952.
jansvoboda11 added a comment.

Rebase on top of extracted patches.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

Files:
  clang/include/clang/Lex/ExternalPreprocessorSource.h
  clang/include/clang/Lex/Preprocessor.h
  clang/include/clang/Serialization/ASTBitCodes.h
  clang/include/clang/Serialization/ASTReader.h
  clang/include/clang/Serialization/ASTWriter.h
  clang/include/clang/Serialization/ModuleFile.h
  clang/lib/Basic/Module.cpp
  clang/lib/Lex/PPLexerChange.cpp
  clang/lib/Lex/Preprocessor.cpp
  clang/lib/Serialization/ASTReader.cpp
  clang/lib/Serialization/ASTWriter.cpp
  clang/test/Modules/import-submodule-visibility.c

Index: clang/test/Modules/import-submodule-visibility.c
===
--- /dev/null
+++ clang/test/Modules/import-submodule-visibility.c
@@ -0,0 +1,99 @@
+// This test checks that imports of headers that appeared in a different submodule than
+// what is imported by the current TU don't affect the compilation.
+
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+
+//--- A.framework/Headers/A.h
+#include "Textual.h"
+//--- A.framework/Modules/module.modulemap
+framework module A { header "A.h" }
+
+//--- B.framework/Headers/B1.h
+#include "Textual.h"
+//--- B.framework/Headers/B2.h
+//--- B.framework/Modules/module.modulemap
+framework module B {
+  module B1 { header "B1.h" }
+  module B2 { header "B2.h" }
+}
+
+//--- C/C.h
+#include "Textual.h"
+//--- C/module.modulemap
+module C { header "C.h" }
+
+//--- D/D1.h
+#include "Textual.h"
+//--- D/D2.h
+//--- D/module.modulemap
+module D {
+  module D1 { header "D1.h" }
+  module D2 { header "D2.h" }
+}
+
+//--- E/E1.h
+#include "E2.h"
+//--- E/E2.h
+#include "Textual.h"
+//--- E/module.modulemap
+module E {
+  module E1 { header "E1.h" }
+  module E2 { header "E2.h" }
+}
+
+//--- Textual.h
+#define MACRO_TEXTUAL 1
+
+//--- test.c
+
+#ifdef A
+//
+#endif
+
+#ifdef B
+#import 
+#endif
+
+#ifdef C
+//
+#endif
+
+#ifdef D
+#import "D/D2.h"
+#endif
+
+#ifdef E
+#import "E/E1.h"
+#endif
+
+#import "Textual.h"
+
+static int x = MACRO_TEXTUAL;
+
+// Specifying the PCM file on the command line (without actually importing "A") should not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/A.framework/Modules/module.modulemap -fmodule-name=A -o %t/A.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DA -fmodule-file=%t/A.pcm
+
+// Specifying the PCM file on the command line and importing "B2" in the source does not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/B.framework/Modules/module.modulemap -fmodule-name=B -o %t/B.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DB -iframework %t -fmodule-file=%t/B.pcm
+
+// Module-only version of the test with framework A.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/C/module.modulemap -fmodule-name=C -o %t/C.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DC -fmodule-file=%t/C.pcm
+
+// Module-only version of the test with framework B.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/D/module.modulemap -fmodule-name=D -o %t/D.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DD -fmodule-file=%t/D.pcm
+
+// Transitively imported, but not exported.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/E/module.modulemap -fmodule-name=E -o %t/E.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DE -fmodule-file=%t/E.pcm
Index: clang/lib/Serialization/ASTWriter.cpp
===
--- clang/lib/Serialization/ASTWriter.cpp
+++ clang/lib/Serialization/ASTWriter.cpp
@@ -2170,11 +2170,12 @@
   return false;
 }
 
-void ASTWriter::writeIncludedFiles(raw_ostream , const Preprocessor ) {
+void ASTWriter::writeIncludedFiles(
+raw_ostream , const llvm::DenseSet ) {
   using namespace llvm::support;
 
   std::vector IncludedInputFiles;
-  for (const auto  : PP.getIncludedFiles()) {
+  for (const auto  : Files) {
 auto InputFileIt = InputFileIDs.find(File);
 if (InputFileIt == InputFileIDs.end())
   continue;
@@ -2406,7 +2407,7 @@
 
 SmallString<2048> Buffer;
 raw_svector_ostream Out(Buffer);
-writeIncludedFiles(Out, PP);
+writeIncludedFiles(Out, PP.getNullSubmoduleIncludes());
 RecordData::value_type Record[] = {PP_INCLUDED_FILES};
 Stream.EmitRecordWithBlob(IncludedFilesAbbrev, Record, Buffer.data(),
   Buffer.size());
@@ -2666,6 +2667,11 @@
   Abbrev->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::Blob));// Macro name
   unsigned ExportAsAbbrev = Stream.EmitAbbrev(std::move(Abbrev));
 
+  Abbrev = std::make_shared();
+  Abbrev->Add(BitCodeAbbrevOp(SUBMODULE_INCLUDED_FILES));
+  

[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-16 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

Didn't go in-depth for serialization/deserialization. When we add tracking 
`isImport` on per-submodule basis, do you think AST reading/writing would 
change significantly?




Comment at: clang/include/clang/Lex/ExternalPreprocessorSource.h:45
+  /// Return the files directly included in the given (sub)module.
+  virtual const llvm::DenseMap *
+  getIncludedFiles(Module *M) = 0;

I think it is better for understanding and more convenient to use some `using` 
instead of duplicating `llvm::DenseMap` in 
multiple places.



Comment at: clang/include/clang/Lex/Preprocessor.h:775-781
+  struct SubmoduleIncludeState {
+/// For each included file, we track the number of includes.
+llvm::DenseMap IncludedFiles;
+
+/// The set of modules that are visible within the submodule.
+VisibleModuleSet VisibleModules;
+  };

I think the interplay between `CurSubmoduleIncludeState`, `IncludedFiles`, and 
`CurSubmoduleState` is pretty complicated. Recently I've realized that it can 
be beneficial to distinguish include tracking for the purpose of serializing 
per submodule and for the purpose of deciding if should enter a file. In 
D114051 I've tried to illustrate this approach. There are some tests failing 
but hope the main idea is still understandable.

One of my big concerns is tracking `VisibleModules` in multiple places. D114051 
demonstrates one of the ways to deal with it but I think it is more important 
for you to know the problem I was trying to solve, not the solution I came up 
with.



Comment at: clang/lib/Basic/Module.cpp:653
 ImportLocs[ID] = Loc;
-Vis(M);
+Vis(V.M);
 

Was meaning to make this fix for a long time but couldn't test it. Thanks for 
finally fixing it!



Comment at: clang/lib/Lex/Preprocessor.cpp:1336
+  [&](Module *M) {
+const auto *Includes = getLocalSubmoduleIncludes(M);
+if (!Includes)

If I drop checking `getLocalSubmoduleIncludes`, no tests are failing. But it 
seems like this call is required. How can we test it?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-10 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

As we've discussed earlier, tracking `isImport` shouldn't be done per .pcm and 
here is the test case 
https://gist.github.com/vsapsai/a2d2bd19c54c24540495fd9b262106aa I'm not sure 
it is worth adding the second `#include` as the test fails just with one.

Overall, the change seems more complicated than it has to be. I need to check 
it carefully to see what can be simplified. And I need to check in debugger how 
and when AST reading is triggered. Looks like not all of that is lazy but I 
need to check the compiled code, not my guesses.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-10 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a comment.

Implementation looks a lot cleaner!

I'd still like to drop NumIncludes first if possible because I think it'll be 
easier to reason about without this extra layer of complexity. Also, that'd 
mitigate the potential regression in `.pcm` size.

(Note: I'd be more comfortable with @vsapsai and/or @rsmith reviewing the 
overall direction; I just jumped in for the implementation details.)




Comment at: clang/include/clang/Lex/HeaderSearch.h:133-139
-
-  /// Determine whether this is a non-default header file info, e.g.,
-  /// it corresponds to an actual header we've included or tried to include.
-  bool isNonDefault() const {
-return isImport || isPragmaOnce || NumIncludes || ControllingMacro ||
-  ControllingMacroID;
-  }

Looks like this is already dead code? If so, please separate out and commit 
ahead of time (e.g., now).



Comment at: clang/include/clang/Serialization/ModuleFile.h:399
+  /// include information.
+  llvm::StringMap> SubmoduleIncludedFiles;
+

Each StringMapEntry is going to have a pretty big allocation here, for a 512B 
vector. Given that it doesn't need to be after creation, it'd be more efficient 
to use this pattern:
```
lang=c++
llvm::StringMap> SubmoduleIncludedFiles;
SpecificBumpPtrAlloc SubmoduleIncludedFilesAlloc;

// later
MutableArrayRef Files(SubmoduleIncludedFiles.Allocate(Record.size()), 
Record.size());
llvm::copy(Record, Files.begin());
SubmoduleIncludedFiles[Key] = Files;
```

That said, I feel like this should just be:
```
lang=c++
DenseMap SubmoduleIncludedFiles;
```
The key can point at the name of the submodule, which must already exist 
somewhere without needing a StringMap to create a new copy of it. The value is 
a lazily deserialized blob.



Comment at: clang/lib/Lex/Preprocessor.cpp:1323
+auto  = SubmoduleIncludeStates[M].NumIncludes;
+for (unsigned UID = 0; UID <= ModNumIncludes.maxUID(); ++UID) {
+  CurSubmoduleIncludeState->NumIncludes[UID] += ModNumIncludes[UID];

jansvoboda11 wrote:
> dexonsmith wrote:
> > jansvoboda11 wrote:
> > > Iterating over all FileEntries is probably not very efficient, as 
> > > Volodymyr mentioned. Thinking about how to make this more efficient...
> > My suggestion above to drop FileEntryMap in favour of a simple DenseMap 
> > would help a bit, just iterating through the files actually included by the 
> > submodules.
> > 
> > Further, I wonder if "num-includes"/file/submodule (`unsigned`) is actually 
> > useful, vs. "was-included"/file/submodule (`bool`). The only observer I see 
> > is `HeaderSearch::PrintStats()` but maybe I missed something?
> > 
> > If I'm right and we can switch to `bool`, then NumIncludes becomes a 
> > `DenseSet IncludedFiles` (or `DenseSet` for UIDs). 
> > (BTW, I also wondered if you could rotate the map, using File as the outer 
> > key, and then use bitsets for the sbumodules, but I doubt this is better, 
> > since most files are only included by a few submodules, right?)
> > 
> > Then you can just do a set union here. Also simplifies bitcode 
> > serialization.
> > 
> > (If a `bool`/set is sufficient, then I'd suggest landing that 
> > first/separately, before adding the per-submodule granularity in this 
> > patch.)
> For each file, we need to have three distinct states: not included at all, 
> included exactly once (`firstTimeLexingFile`), included more than once. This 
> means we can't use a single `DenseSet`.
> 
> But we could use a `DenseMap`, where "not included at all" can be 
> expressed as being absent from the map, exactly once as having `true` in the 
> map and more than once as having `false` in the map. Alternatively, we could 
> use two `DenseSet` instances to encode the same, but I don't think having two 
> lookups per file to determine stuff is efficient.
> 
> I can look into this in a follow-up patch.
Seems like a DenseSet could still be used by having HeaderInfo pass back the 
WasInserted bit from the insertion to the preprocessor, and threading it 
through to Preprocessor::HandleEndOfFile (the only caller of 
FirstTimeLexingFile):
```
lang=c++
bool IsFirst = Set.insert(Key).second;
```
The threading doesn't seem too hard. Looking at main:
- Preprocessor::HandleHeaderIncludeOrImport calls 
HeaderInfo::ShouldEnterIncludeFile. This does the `++FI.NumIncludes` (going 
from 0 to 1). Instead, it could be `IsFirst = !FI.WasIncluded; FI.WasIncluded = 
true;`, then return `IsFirst` somehow. (Then your patch can pull `IsFirst` from 
the `insert().second`).
- Preprocessor::HandleHeaderIncludeOrImport calls 
Preprocessor::EnterSourceFile. This creates a new Lexer for that file. 
`IsFirst` can be stored on that Lexer.
- Preprocessor::HandleEndOfFile calls FirstTimeLexingFile. Instead, it can 
check the new accessor `CurLexer->isFirstTimeLexing()`.

> I can look into this in a follow-up patch.

Follow-up might be okay, but it'd be nice to 

[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-10 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 updated this revision to Diff 386139.
jansvoboda11 added a comment.

Store only direct includes in the PCM (compared to transitive stored 
previously), use InputFile ID (instead of full filesystem path).

Also: add test of transitive includes, fix bug in 
`VisibleModuleSet::setVisible`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

Files:
  clang/include/clang/Lex/ExternalPreprocessorSource.h
  clang/include/clang/Lex/HeaderSearch.h
  clang/include/clang/Lex/Preprocessor.h
  clang/include/clang/Serialization/ASTBitCodes.h
  clang/include/clang/Serialization/ASTReader.h
  clang/include/clang/Serialization/ASTWriter.h
  clang/include/clang/Serialization/ModuleFile.h
  clang/lib/Basic/Module.cpp
  clang/lib/Lex/HeaderSearch.cpp
  clang/lib/Lex/PPDirectives.cpp
  clang/lib/Lex/PPLexerChange.cpp
  clang/lib/Lex/Preprocessor.cpp
  clang/lib/Serialization/ASTReader.cpp
  clang/lib/Serialization/ASTWriter.cpp
  clang/test/Modules/import-submodule-visibility.c

Index: clang/test/Modules/import-submodule-visibility.c
===
--- /dev/null
+++ clang/test/Modules/import-submodule-visibility.c
@@ -0,0 +1,99 @@
+// This test checks that imports of headers that appeared in a different submodule than
+// what is imported by the current TU don't affect the compilation.
+
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+
+//--- A.framework/Headers/A.h
+#include "Textual.h"
+//--- A.framework/Modules/module.modulemap
+framework module A { header "A.h" }
+
+//--- B.framework/Headers/B1.h
+#include "Textual.h"
+//--- B.framework/Headers/B2.h
+//--- B.framework/Modules/module.modulemap
+framework module B {
+  module B1 { header "B1.h" }
+  module B2 { header "B2.h" }
+}
+
+//--- C/C.h
+#include "Textual.h"
+//--- C/module.modulemap
+module C { header "C.h" }
+
+//--- D/D1.h
+#include "Textual.h"
+//--- D/D2.h
+//--- D/module.modulemap
+module D {
+  module D1 { header "D1.h" }
+  module D2 { header "D2.h" }
+}
+
+//--- E/E1.h
+#include "E2.h"
+//--- E/E2.h
+#include "Textual.h"
+//--- E/module.modulemap
+module E {
+  module E1 { header "E1.h" }
+  module E2 { header "E2.h" }
+}
+
+//--- Textual.h
+#define MACRO_TEXTUAL 1
+
+//--- test.c
+
+#ifdef A
+//
+#endif
+
+#ifdef B
+#import 
+#endif
+
+#ifdef C
+//
+#endif
+
+#ifdef D
+#import "D/D2.h"
+#endif
+
+#ifdef E
+#import "E/E1.h"
+#endif
+
+#import "Textual.h"
+
+static int x = MACRO_TEXTUAL;
+
+// Specifying the PCM file on the command line (without actually importing "A") should not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/A.framework/Modules/module.modulemap -fmodule-name=A -o %t/A.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DA -fmodule-file=%t/A.pcm
+
+// Specifying the PCM file on the command line and importing "B2" in the source does not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/B.framework/Modules/module.modulemap -fmodule-name=B -o %t/B.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DB -iframework %t -fmodule-file=%t/B.pcm
+
+// Module-only version of the test with framework A.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/C/module.modulemap -fmodule-name=C -o %t/C.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DC -fmodule-file=%t/C.pcm
+
+// Module-only version of the test with framework B.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/D/module.modulemap -fmodule-name=D -o %t/D.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DD -fmodule-file=%t/D.pcm
+
+// Transitively imported, but not exported.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/E/module.modulemap -fmodule-name=E -o %t/E.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DE -fmodule-file=%t/E.pcm
Index: clang/lib/Serialization/ASTWriter.cpp
===
--- clang/lib/Serialization/ASTWriter.cpp
+++ clang/lib/Serialization/ASTWriter.cpp
@@ -1696,7 +1696,7 @@
 std::pair
 EmitKeyDataLength(raw_ostream& Out, key_type_ref key, data_type_ref Data) {
   unsigned KeyLen = key.Filename.size() + 1 + 8 + 8;
-  unsigned DataLen = 1 + 2 + 4 + 4;
+  unsigned DataLen = 1 + 4 + 4;
   for (auto ModInfo : Data.KnownHeaders)
 if (Writer.getLocalOrImportedSubmoduleID(ModInfo.getModule()))
   DataLen += 4;
@@ -1728,7 +1728,6 @@
   | (Data.HFI.DirInfo << 1)
   | Data.HFI.IndexHeaderMapHeader;
   LE.write(Flags);
-  LE.write(Data.HFI.NumIncludes);
 
   if (!Data.HFI.ControllingMacro)
 LE.write(Data.HFI.ControllingMacroID);
@@ -2171,6 +2170,42 @@
   return false;
 }
 
+void ASTWriter::addIncludedFiles(
+const llvm::DenseMap ,
+RecordDataImpl ) {
+  // 

[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-09 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 updated this revision to Diff 385751.
jansvoboda11 marked an inline comment as done.
jansvoboda11 added a comment.

Make clang-format happy.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915

Files:
  clang/include/clang/Lex/ExternalPreprocessorSource.h
  clang/include/clang/Lex/HeaderSearch.h
  clang/include/clang/Lex/Preprocessor.h
  clang/include/clang/Serialization/ASTBitCodes.h
  clang/include/clang/Serialization/ASTReader.h
  clang/include/clang/Serialization/ASTWriter.h
  clang/include/clang/Serialization/ModuleFile.h
  clang/lib/Lex/HeaderSearch.cpp
  clang/lib/Lex/PPDirectives.cpp
  clang/lib/Lex/PPLexerChange.cpp
  clang/lib/Lex/Preprocessor.cpp
  clang/lib/Serialization/ASTReader.cpp
  clang/lib/Serialization/ASTWriter.cpp
  clang/test/Modules/import-submodule-visibility.c

Index: clang/test/Modules/import-submodule-visibility.c
===
--- /dev/null
+++ clang/test/Modules/import-submodule-visibility.c
@@ -0,0 +1,80 @@
+// This test checks that imports of headers that appeared in a different submodule than
+// what is imported by the current TU don't affect the compilation.
+
+// RUN: rm -rf %t
+// RUN: split-file %s %t
+
+//--- A.framework/Headers/A.h
+#include "Textual.h"
+//--- A.framework/Modules/module.modulemap
+framework module A { header "A.h" }
+
+//--- B.framework/Headers/B1.h
+#include "Textual.h"
+//--- B.framework/Headers/B2.h
+//--- B.framework/Modules/module.modulemap
+framework module B {
+  module B1 { header "B1.h" }
+  module B2 { header "B2.h" }
+}
+
+//--- C/C.h
+#include "Textual.h"
+//--- C/module.modulemap
+module C { header "C.h" }
+
+//--- D/D1.h
+#include "Textual.h"
+//--- D/D2.h
+//--- D/module.modulemap
+module D {
+  module D1 { header "D1.h" }
+  module D2 { header "D2.h" }
+}
+
+//--- Textual.h
+#define MACRO_TEXTUAL 1
+
+//--- test.c
+
+#ifdef A
+//
+#endif
+
+#ifdef B
+#import 
+#endif
+
+#ifdef C
+//
+#endif
+
+#ifdef D
+#import "D/D2.h"
+#endif
+
+#import "Textual.h"
+
+static int x = MACRO_TEXTUAL;
+
+// Specifying the PCM file on the command line (without actually importing "A") should not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/A.framework/Modules/module.modulemap -fmodule-name=A -o %t/A.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DA -fmodule-file=%t/A.pcm -fmodule-map-file=%t/A.framework/Modules/module.modulemap
+
+// Specifying the PCM file on the command line and importing "B2" in the source does not
+// prevent "Textual.h" to be included in the TU.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/B.framework/Modules/module.modulemap -fmodule-name=B -o %t/B.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DB -iframework %t -fmodule-file=%t/B.pcm -fmodule-map-file=%t/B.framework/Modules/module.modulemap
+
+// Module-only version of the test with framework A.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/C/module.modulemap -fmodule-name=C -o %t/C.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DC -fmodule-file=%t/C.pcm -fmodule-map-file=%t/C/module.modulemap
+
+// Module-only version of the test with framework B.
+//
+// RUN: %clang_cc1 -fmodules -I %t -emit-module %t/D/module.modulemap -fmodule-name=D -o %t/D.pcm
+// RUN: %clang_cc1 -fmodules -I %t -fsyntax-only %t/test.c -DD -fmodule-file=%t/D.pcm -fmodule-map-file=%t/D/module.modulemap
Index: clang/lib/Serialization/ASTWriter.cpp
===
--- clang/lib/Serialization/ASTWriter.cpp
+++ clang/lib/Serialization/ASTWriter.cpp
@@ -1696,7 +1696,7 @@
 std::pair
 EmitKeyDataLength(raw_ostream& Out, key_type_ref key, data_type_ref Data) {
   unsigned KeyLen = key.Filename.size() + 1 + 8 + 8;
-  unsigned DataLen = 1 + 2 + 4 + 4;
+  unsigned DataLen = 1 + 4 + 4;
   for (auto ModInfo : Data.KnownHeaders)
 if (Writer.getLocalOrImportedSubmoduleID(ModInfo.getModule()))
   DataLen += 4;
@@ -1728,7 +1728,6 @@
   | (Data.HFI.DirInfo << 1)
   | Data.HFI.IndexHeaderMapHeader;
   LE.write(Flags);
-  LE.write(Data.HFI.NumIncludes);
 
   if (!Data.HFI.ControllingMacro)
 LE.write(Data.HFI.ControllingMacroID);
@@ -2171,6 +2170,35 @@
   return false;
 }
 
+void ASTWriter::addIncludedFiles(
+const llvm::DenseMap ,
+RecordDataImpl ) {
+  auto Vec = llvm::to_vector<0>(Files);
+
+  // Ensure the order is stable.
+  llvm::sort(Vec, [](const auto , const auto ) {
+return LHS.first->getName().compare(RHS.first->getName()) < 0;
+  });
+
+  // The first partition contains files included exactly once,
+  // the second partition contains files included more than once.
+  auto It = std::stable_partition(Vec.begin(), Vec.end(),
+  

[PATCH] D112915: [clang][modules] Track number of includes per submodule

2021-11-09 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 marked 2 inline comments as done.
jansvoboda11 added a comment.

Thanks for your feedback @dexonsmith. I reworked the patch to use more sensible 
data structures as suggested, and made the AST deserialization lazy (on 
(sub)module import).

I think the only thing to figure out is the exact structure of the serialized 
information - whether we're fine with serializing all transitively included 
files in each AST file, or whether we'd like to fetch that information from 
other AST files instead.

I removed the WIP tag and would be happy to gather more feedback.




Comment at: clang/include/clang/Lex/Preprocessor.h:720-723
+  /// Efficient map with FileEntry keys.
+  template  class FileEntryMap {
+/// The underlying storage. Entries are indexed by FileEntry UID.
+std::vector UIDToValue;

dexonsmith wrote:
> I'm not sure about this choice. UIDs are unlikely to be adjacent and in 
> SubmoduleIncludeState, since a given submodule is unlikely to have "most" 
> files. Also, memory usage characteristics could be "weird" if the FileManager 
> is being reused between different compilations.
> 
> `DenseMap` will have the same number of allocations 
> (i.e., just 1). If UIDs really are dense and near each other in one of the 
> submodules then that map will be a little bigger than necessary (~2x), but it 
> should be better for the rest of them.
You're right that UIDs of files included in a single submodule are unlikely to 
have "most" files. Thanks for pointing that out.

In the latest revision, I switched to `llvm::DenseMap`.



Comment at: clang/lib/Lex/Preprocessor.cpp:1323
+auto  = SubmoduleIncludeStates[M].NumIncludes;
+for (unsigned UID = 0; UID <= ModNumIncludes.maxUID(); ++UID) {
+  CurSubmoduleIncludeState->NumIncludes[UID] += ModNumIncludes[UID];

dexonsmith wrote:
> jansvoboda11 wrote:
> > Iterating over all FileEntries is probably not very efficient, as Volodymyr 
> > mentioned. Thinking about how to make this more efficient...
> My suggestion above to drop FileEntryMap in favour of a simple DenseMap would 
> help a bit, just iterating through the files actually included by the 
> submodules.
> 
> Further, I wonder if "num-includes"/file/submodule (`unsigned`) is actually 
> useful, vs. "was-included"/file/submodule (`bool`). The only observer I see 
> is `HeaderSearch::PrintStats()` but maybe I missed something?
> 
> If I'm right and we can switch to `bool`, then NumIncludes becomes a 
> `DenseSet IncludedFiles` (or `DenseSet` for UIDs). 
> (BTW, I also wondered if you could rotate the map, using File as the outer 
> key, and then use bitsets for the sbumodules, but I doubt this is better, 
> since most files are only included by a few submodules, right?)
> 
> Then you can just do a set union here. Also simplifies bitcode serialization.
> 
> (If a `bool`/set is sufficient, then I'd suggest landing that 
> first/separately, before adding the per-submodule granularity in this patch.)
For each file, we need to have three distinct states: not included at all, 
included exactly once (`firstTimeLexingFile`), included more than once. This 
means we can't use a single `DenseSet`.

But we could use a `DenseMap`, where "not included at all" can be 
expressed as being absent from the map, exactly once as having `true` in the 
map and more than once as having `false` in the map. Alternatively, we could 
use two `DenseSet` instances to encode the same, but I don't think having two 
lookups per file to determine stuff is efficient.

I can look into this in a follow-up patch.



Comment at: clang/lib/Serialization/ASTWriter.cpp:2392-2393
+
+  // TODO: Use plain vector, submodule IDs are small and dense.
+  std::map> ToBeSerialized;
+

dexonsmith wrote:
> A vector of maps would be an improvement, but that'll still be a lot of 
> allocations.
> - Since insertion/lookup/deletion aren't intermingled, the simplest way to 
> avoid adding unnecessary overhead is a sorted vector 
> (https://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap).
> - With no lookups (at all), there's no benefit to a tiered data structure (vs 
> flat).
> 
> Leading me toward a simple flat vector + sort.
> ```
> lang=c++
> struct IncludeToSerialize { // Probably more straightforward than a 
> std::tuple...
>   StringRef Filename;
>   unsigned SMID;
>   unsigned NumIncludes;
> 
>   bool operator<(const IncludeToSerialize ) const {
> if (SMID != RHS.SMID)
>   return SMID < RHS.SMID;
> int Diff = Filename.compare(RHS.Filename);
> assert(Diff && "Expected unique SMID / Filename pairs");
> return Diff < 0;
>   }
> };
> SmallVector IncludesToSerialize;
> // ...
> IncludesToSerialize.push_back({Filename, LocalSMID, NumIncludes});
> // ...
> llvm::sort(IncludesToSerialize);
> for (const IncludeToSerialize  : IncludesToSerialize) {
>   // emit record
> }
> ```
> (Or if there are