[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-26 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

> I don't recall any issues on my last benchmark, but i'll run the patch across 
> all of our modular files and see if anything comes up.

I ran the patch against ~1500 sources here and there were no unexpected 
warnings or failures.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-26 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3088479 , @vsapsai wrote:

> Code-wise I'm not aware of any remaining issues. Still need to update the 
> commit message and to re-run the  clang test suite. But you can totally use 
> the patch for testing. I plan to update D110123 
>  for the review today/tomorrow.

Sounds good.

> In my limited internal testing I've seen a single extra warning due to 
> `[(id)specificObject commonMethodName]`. I've discussed it with other 
> Objective-C developers and the consensus is that with calling methods on `id` 
> you cannot predict which exactly method signature will be selected and the 
> recommended solution to cast `specificObject` to correct type with the known 
> method signature. It might be worth running a more extensive test and make 
> sure there are no unintended consequences. That will take me around a week or 
> slightly more.

I don't recall any issues on my last benchmark, but i'll run the patch across 
all of our modular files and see if anything comes up.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-26 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3086044 , @rmaz wrote:

> @vsapsai i'll abandon this diff then, thanks for your extensive feedback on 
> the approach. Is D110123  shippable 
> already, or are there some more corner cases to cover?

Code-wise I'm not aware of any remaining issues. Still need to update the 
commit message and to re-run the  clang test suite. But you can totally use the 
patch for testing. I plan to update D110123  
for the review today/tomorrow.

In my limited internal testing I've seen a single extra warning due to 
`[(id)specificObject commonMethodName]`. I've discussed it with other 
Objective-C developers and the consensus is that with calling methods on `id` 
you cannot predict which exactly method signature will be selected and the 
recommended solution to cast `specificObject` to correct type with the known 
method signature. It might be worth running a more extensive test and make sure 
there are no unintended consequences. That will take me around a week or 
slightly more.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-25 Thread Richard Howell via Phabricator via cfe-commits
rmaz abandoned this revision.
rmaz added a comment.

@vsapsai i'll abandon this diff then, thanks for your extensive feedback on the 
approach. Is D110123  shippable already, or 
are there some more corner cases to cover?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-25 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3085187 , @dexonsmith 
wrote:

> But another benefit of not double-storing transitively imported methods is 
> that it makes the PCMs more independent, tacking slightly closer to 
> "ImmediateDep1.pcm" being reproducible even when "SharedDep.pcm" adds a 
> method to the global pool. This is a nice property if it's not too expensive. 
> Looking at the numbers above, it doesn't look expensive; the relative 
> performance for @rmaz's motivating use case seems pretty small.
>
> @rmaz, will your goals be achieved by taking @vsapsai's approach? If so, I'm 
> leaning slightly that way.

We can definitely work with it. From my perspective the faster solution would 
be preferred, but if there are potential future benefits from not storing 
dependent modules methods in a pcm them we can ship this solution and look at 
other possible avenues for performance improvement to gain parity with the 
non-modular case.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-25 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a comment.

In D109632#3081580 , @vsapsai wrote:

> We are targeting the use case where in impl.m we have
>
>   @import ImmediateDep1;
>   @import ImmediateDep2;
>   ...
>   @import ImmediateDepN;
>
> and each of ImmediateDep has `@import SharedDep;`.
>
> For simplicity we'll consider that we are working with a single selector as 
> each of them is processed separately and independently. When clang encounters 
> a selector, it wants to collect methods corresponding to this selector from 
> all modules to a GlobalMethodPool. As in GlobalMethodPool  we store methods 
> in a list, adding a new method requires traversing this list. It is possible 
> to change the list to something else but this change isn't about that, in 
> `O(L*X)` complexity we are reducing `X` (which can be ~10) and not `L` 
> (which is closer to ~1000).
>
> In the baseline METHOD_POOL for each "ImmediateDep" contains methods from 
> "SharedDep". When we add methods to the GlobalMethodPool, we try to add 
> methods from all "ImmediateDep". As the result we iterate through 
> GlobalMethodPool method list multiple time for each method from "SharedDep" 
> as they are available in each "ImmediateDep".
>
> Richard's idea is to put a DenseSet of encountered methods in front of 
> GlobalMethodPool method list. This way duplicates from "SharedDep" can be 
> detected and discarded quickly so we traverse a list only for each unique 
> method and not for duplicates.
>
> My idea is not to store any duplicates in METHOD_POOL. This is achieved by 
> each module storing only their own methods and not storing any [transitively] 
> imported methods. In turn, populating GlobalMethodPool requires traversing 
> the full dependency graph once and loading methods from METHOD_POOL in 
> corresponding modules compared to traversing only immediate dependencies as 
> in the baseline.

Thanks for the explanation; this was helpful.

What's the relative storage cost between the two proposals? I assume small or 
it would have been mentioned.

But another benefit of not double-storing transitively imported methods is that 
it makes the PCMs more independent, tacking slightly closer to 
"ImmediateDep1.pcm" being reproducible even when "SharedDep.pcm" adds a method 
to the global pool. This is a nice property if it's not too expensive. Looking 
at the numbers above, it doesn't look expensive; the relative performance for 
@rmaz's motivating use case seems pretty small.

@rmaz, will your goals be achieved by taking @vsapsai's approach? If so, I'm 
leaning slightly that way.

> I believe MultiOnDiskHashTable isn't really applicable to this performance 
> issue. Even if we use it to store methods, it doesn't affect the amount of 
> methods we process. And right now processing all the duplicates is the 
> expensive part.

Thanks for explaining.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-24 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

Raw data with some additional visualization is available at 
https://observablehq.com/@vsapsai/method_pool-performance-comparison


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-22 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

We are targeting the use case where in impl.m we have

  @import ImmediateDep1;
  @import ImmediateDep2;
  ...
  @import ImmediateDepN;

and each of ImmediateDep has `@import SharedDep;`.

For simplicity we'll consider that we are working with a single selector as 
each of them is processed separately and independently. When clang encounters a 
selector, it wants to collect methods corresponding to this selector from all 
modules to a GlobalMethodPool. As in GlobalMethodPool  we store methods in a 
list, adding a new method requires traversing this list. It is possible to 
change the list to something else but this change isn't about that, in `O(L*X)` 
complexity we are reducing `X` (which can be ~10) and not `L` (which is 
closer to ~1000).

In the baseline METHOD_POOL for each "ImmediateDep" contains methods from 
"SharedDep". When we add methods to the GlobalMethodPool, we try to add methods 
from all "ImmediateDep". As the result we iterate through GlobalMethodPool 
method list multiple time for each method from "SharedDep" as they are 
available in each "ImmediateDep".

Richard's idea is to put a DenseSet of encountered methods in front of 
GlobalMethodPool method list. This way duplicates from "SharedDep" can be 
detected and discarded quickly so we traverse a list only for each unique 
method and not for duplicates.

My idea is not to store any duplicates in METHOD_POOL. This is achieved by each 
module storing only their own methods and not storing any [transitively] 
imported methods. In turn, populating GlobalMethodPool requires traversing the 
full dependency graph once and loading methods from METHOD_POOL in 
corresponding modules compared to traversing only immediate dependencies as in 
the baseline.

I believe MultiOnDiskHashTable isn't really applicable to this performance 
issue. Even if we use it to store methods, it doesn't affect the amount of 
methods we process. And right now processing all the duplicates is the 
expensive part.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-21 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a comment.

In D109632#3079745 , @vsapsai wrote:

> In D109632#3079455 , @rmaz wrote:
>
>> So given these numbers are we good to go ahead with set dedupe approach?
>
> I'd rather get an opinion on this from other reviewers. It's not purely a 
> numbers game, there can be other reasons to prefer one solution over another. 
> I am biased, so I don't think I can make this call. If reviewers want a short 
> summary instead of checking the entire discussion, we can prepare that.

Can you summarize how each of the two proposed new architectures differ from 
the baseline?

Note also that clang/lib/Serialization/MultiOnDiskHashTable.h is an option if 
there's a tradeoff between redundantly repeating information in module files 
(optimize for storage) and the number of module files that need to be visited 
(optimize for having information available). I think I mentioned it way up 
thread (indirectly, when talking about identifier tables), but I'm not sure if 
it was considered.

> If anybody is interested in the raw measurement data, I can provide it. The 
> data is suitable for calculating the statistical significance between 
> different methods using t-test.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-21 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3079455 , @rmaz wrote:

> So given these numbers are we good to go ahead with set dedupe approach?

I'd rather get an opinion on this from other reviewers. It's not purely a 
numbers game, there can be other reasons to prefer one solution over another. I 
am biased, so I don't think I can make this call. If reviewers want a short 
summary instead of checking the entire discussion, we can prepare that.

If anybody is interested in the raw measurement data, I can provide it. The 
data is suitable for calculating the statistical significance between different 
methods using t-test.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-21 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

I am seeing the clean build times behaving the same as the populated ones, just 
slower:

| *File*| *Baseline (s)* | *Set Dedupe* | *No External* |
| IGMFVC.mm | 230| 194  | 195   |
|

So given these numbers are we good to go ahead with set dedupe approach?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-21 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

Curiously, the results for the second project are unexpected and bizarre.

| **File**| **Baseline (s)** | **Set Dedupe (s)** | **No external (s)** 
| **Set Dedupe (percentage of baseline)** | **No external (percentage of 
baseline)** |
| Project2. File1 | 0.378582125  | 0.318364375| 
0.318319875 | 84.094% | 84.082% 
 |
| Project2. File2 | 0.245813125  | 0.24931675| 
0.25345175 | 101.425%| 103.107% 
|
| Project2. File3 | 0.24643325  | 0.260554875| 
0.261508| 105.730%| 106.117%
 |
| Project2. File4 | 0.24745525  | 0.256896   | 0.253167625  
   | 103.815%| 102.308% 
|
| Project2. File5 | 0.2483135  | 0.256163875| 
0.2635545 | 103.161%| 106.138%  
   |
|

Standard deviation goes down again and doesn't exceed 0.02. This project is a 
big desktop app, so expect it to have long-isa dependency chains. Also, 
anecdotally, when I compared baseline and "no external" full project builds, 
"no external" was 20% faster.

My understanding is that due to the short compilation times the measurements 
are less reliable (haven't done anything heroic to make sure nothing else was 
running on my machine). And maybe these files are inherently more volatile 
because when I was selecting them for measurements, their compile time was >20s 
and now we have ~0.25s (with clean module cache it is still below 20s).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-20 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3076648 , @rmaz wrote:

> In D109632#3076586 , @vsapsai wrote:
>
>> I'm curious to get the results for an empty module cache because clean 
>> builds are also important for us.
>
> I should measure this too. What would you suggest for the approach, to clean 
> the module cache before each build, retry and average? How much weight should 
> be given to the clean vs populated module cache numbers?

I am compiling just a single file and before each compilation I

1. Delete ModuleCache.noindex
2. Create ModuleCache.noindex directory
3. Create ModuleCache.noindex/Session.modulevalidation file

Then retry multiple times and average. Given my experience, calculating 
standard deviation can be useful too.

I'm wondering what your numbers will be but for me I think they are fairly 
meaningless due to high deviation. That's why I don't think it's worth 
assigning any weight to the clean module cache build (at least for individual 
files).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-20 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

And for the empty module cache the numbers are

| **File**| **Baseline (s)** | **Set Dedupe (s)** | **No external (s)** 
| **Set Dedupe (percentage of baseline)** | **No external (percentage of 
baseline)** |
| Project1. File1 | 14.080673| 13.6296118   | 13.6954484
| 96.797% | 97.264% 
 |
| Project1. File2 | 15.2273466 | 14.6111798   | 
14.7935872| 95.954% | 97.151%   
   |
| Project1. File3 | 14.0047568 | 13.5988174   | 
13.6155526| 97.101% | 97.221%   
   |
| Project1. File4 | 13.0449874 | 12.5867316   | 
12.8005844| 96.487% | 98.126%   
   |
| Project1. File5 | 14.3756674 | 13.7928578   | 
13.7562042| 95.946% | 95.691%  
   |
|

Though in this case the variance is much higher than before. Roughly it is 10 
times higher and can reach 0.25. With such variance I think the difference 
between different fix approaches isn't meaningful and even comparing with the 
baseline is shaky. I haven't investigated further but suspect that the time of 
building modules dominates `addMethodToGlobalList` time and we cannot notice a 
meaningful difference.

Methodology note: take average of 5 compilations instead of 8 as each 
compilation is longer.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-20 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3076586 , @vsapsai wrote:

> Methodology: clear a modules cache, compile a file once to pre-populate the 
> cache, compile file 8 times and measure elapsed time, take the time average.

This is the same approach I used, although with 3 tries.

> So it looks like "no external" approach is slightly but consistently slower  
> than "set dedupe" approach.

This agrees with what I see with our code too.

> I'm curious to get the results for an empty module cache because clean builds 
> are also important for us.

I should measure this too. What would you suggest for the approach, to clean 
the module cache before each build, retry and average? How much weight should 
be given to the clean vs populated module cache numbers?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-20 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

Sorry for the delay, I haven't finished testing more projects yet but here are 
my preliminary numbers

| **File**| **Baseline (s)** | **Set Dedupe (s)** | **No external (s)** 
| **Set Dedupe (percentage of baseline)** | **No external (percentage of 
baseline)** |
| Project1. File1 | 1.26146575  | 0.931755875| 
0.943674125 | 73.863% | 74.808% 
 |
| Project1. File2 | 1.444321875  | 1.0868395| 
1.095805875 | 75.249% | 75.870% 
 |
| Project1. File3 | 1.2475025  | 0.94032725| 
0.942943875 | 75.377% | 75.587% 
 |
| Project1. File4 | 1.197715 | 0.98331625| 0.9898685
 | 82.099% | 82.646%
  |
| Project1. File5 | 1.252481375  | 0.94339925| 
0.945766| 75.322% | 75.511% 
 |
|

Methodology: clear a modules cache, compile a file once to pre-populate the 
cache, compile file 8 times and measure elapsed time, take the time average. 
Didn't include files using a precompiled header because that makes cleaning a 
module cache trickier. Standard deviation of the measurements is consistent 
between different approaches and doesn't go above 0.015, so I haven't included 
it here. "Project1" is an app, so it should have a fairly long chain of 
dependencies, though I haven't checked it in details.

So it looks like "no external" approach is slightly but consistently slower  
than "set dedupe" approach. I'm curious to get the results for an empty module 
cache because clean builds are also important for us. And if you see anything 
suspicious in the methodology, please let me know.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-14 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3062608 , @vsapsai wrote:

> I can be wrong but I like writing less data as it can result in savings for 
> more projects. For example, if compile time is dominated by method 
> deserialization and not by `Sema::addMethodToGlobalList`, we still should see 
> an improvement. I can try to experiment on a bunch of other projects and 
> their heavy .m files and report the results, if you'd like.

I think that would be a useful comparison, yes. For our code I consistently 
measure this approach as 10-20% slower on average, with very few outliers where 
the set deduplication approach is slower.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-13 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

It's good to know `ASTReader::ReadMethodPool` is pretty close for both 
approaches. And as far as I understand, it includes the code

  ReadMethodPoolVisitor Visitor(*this, Sel, PriorGeneration);
  ModuleMgr.visit(Visitor);

so the module visiting doesn't seem to be too expensive.

I can be wrong but I like writing less data as it can result in savings for 
more projects. For example, if compile time is dominated by method 
deserialization and not by `Sema::addMethodToGlobalList`, we still should see 
an improvement. I can try to experiment on a bunch of other projects and their 
heavy .m files and report the results, if you'd like.

Also I've updated D110123  to preserve the 
order of methods (mostly, there are still some differences). Works better than 
the previous approach, compilation time difference is within the noise.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-08 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3046976 , @vsapsai wrote:

> These are interesting results. I'm curious to measure the time spent in 
> `ASTReader::ReadMethodPool`.

Updated numbers from today, looks like we added a lot more modular deps since 
last week.

| **Method**  | **# Module File Lookups** | **# Module File Hits** | **# 
Instance Methods Deserialized** | **# Class Methods Deserialized** | **ms in 
`ASTReader::ReadMethodPool`** |
| Baseline| 139140| 577| 254304 
 | 119907   | 15759 
|
| Set Dedupe  | 139140| 577| 38815  
 | 19125| 421   
|
| No external | 157677| 1595   | 5036   
 | 2428 | 430   
|
|

The time taken in `ASTReader::ReadMethodPool` ends up very close. This was 
previously dominated by `Sema::addMethodToGlobalList`, and both approaches will 
end up with the same number of calls to this method, so maybe this is not so 
surprising. The difference comes down to: is it faster to descend into all 
build dependent modules to deserialize only the methods you need or is it 
faster to deserialize from the first module with overlap and skip the duplicate 
inserts with a hashmap lookup. The numbers would suggest the latter, and this 
approach also has the benefit of requiring no other changes as the method 
insert order to the global list should be identical.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-06 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

These are interesting results. I'm curious to measure the time spent in 
`ASTReader::ReadMethodPool`.

I'm planning to preserve the order of methods roughly by 
`s/InstanceMethods.append/InstanceMethods.prepend/` in `ReadMethodPoolVisitor`. 
`isAcceptableMethodMismatch` is fixable but "id" can have multiple methods 
applicable and existing source code relies on specific methods to be first in 
the methods' lists. Changing the order would break existing working code and we 
cannot afford that. Curious to see how the preserved order might impact the 
performance.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-04 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3037501 , @vsapsai wrote:

> My assumption was that all dependent modules are in memory at this point. And 
> we visit transitive modules only once, so I don't expect it to be a big 
> performance hit (though I can be wrong). And deserializing methods from 
> different modules shouldn't be more work because we are deserializing fewer 
> methods than with "set dedupe".

I added some stats collection for the number of methods deserialized, here are 
the results from the slowest file in the above table:

| **Method**  | **# Module File Lookups** | **# Module File Hits** | **# 
Instance Methods Deserialized** | **# Class Methods Deserialized** |
| Baseline| 34023 | 129| 82005  
 | 29298|
| Set Dedupe  | 34023 | 129| 13743  
 | 6781 |
| No external | 45265 | 960| 4010   
 | 1704 |
|

So while the no external approach has ~3.5x fewer methods to deserialize, it 
has to visit ~7.5x as many module files to deserialize methods from.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-10-01 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3033489 , @rmaz wrote:

> In D109632#3032381 , @vsapsai wrote:
>
>> What's interesting, I was able to trigger more diagnostic. Specifically, I 
>> got warnings about `length` ambiguity because in NSStatusItem it is CGFloat 
>> ,
>>  while in NSString it is NSUInteger. I also had more diagnostic that's 
>> unclear how it is triggered. It can be a project issue or a bug somewhere 
>> else, need to check it more carefully.
>
> Yes, I also had a couple of files fail to compile due to mismatched (or 
> differently matched) selectors.

Diagnostic about `-length` is due to isAcceptableMethodMismatch 

 returning different results depending on the order of methods. Probably other 
diagnostic is caused by a similar issue. Will need to add tests and to fix 
`isAcceptableMethodMismatch`.

>> In theory, "set dedupe" approach should be slower than "no external" as we 
>> are iterating through shared dependencies. But in practice it isn't, which 
>> is puzzling. My current theory is that "no external" also feeds into 
>> correctness, so we might be doing more [correct] method overloading checks.
>
> Well, wouldn't the tradeoff there be that we now have to descend into all 
> dependent modules during selector lookup when we didn't have to previously? 
> And this would do more work as only a small subset of those modules would 
> have a selector match, which in the current case has already been handled and 
> serialized on a single method list.

My assumption was that all dependent modules are in memory at this point. And 
we visit transitive modules only once, so I don't expect it to be a big 
performance hit (though I can be wrong). And deserializing methods from 
different modules shouldn't be more work because we are deserializing fewer 
methods than with "set dedupe". Maybe the order of methods matters? I can see 
us short circuiting in some cases but not the others. Though that's not a very 
plausible explanation for 20-25% discrepancy in compilation time.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-30 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3032381 , @vsapsai wrote:

> What's interesting, I was able to trigger more diagnostic. Specifically, I 
> got warnings about `length` ambiguity because in NSStatusItem it is CGFloat 
> ,
>  while in NSString it is NSUInteger. I also had more diagnostic that's 
> unclear how it is triggered. It can be a project issue or a bug somewhere 
> else, need to check it more carefully.

Yes, I also had a couple of files fail to compile due to mismatched (or 
differently matched) selectors.

> In theory, "set dedupe" approach should be slower than "no external" as we 
> are iterating through shared dependencies. But in practice it isn't, which is 
> puzzling. My current theory is that "no external" also feeds into 
> correctness, so we might be doing more [correct] method overloading checks.

Well, wouldn't the tradeoff there be that we now have to descend into all 
dependent modules during selector lookup when we didn't have to previously? And 
this would do more work as only a small subset of those modules would have a 
selector match, which in the current case has already been handled and 
serialized on a single method list.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-29 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

I've updated D110123  to be the way I was 
planning it to be and I was testing locally with it. So far, with this change 
the speedup over a baseline is ~20%, though measurements aren't super rigorous. 
I.e., no multiple runs to warm up the caches, no control for other background 
activity. At least, with this precision can claim there is no slow down in 
compilation time.

What's interesting, I was able to trigger more diagnostic. Specifically, I got 
warnings about `length` ambiguity because in NSStatusItem it is CGFloat 
, 
while in NSString it is NSUInteger. I also had more diagnostic that's unclear 
how it is triggered. It can be a project issue or a bug somewhere else, need to 
check it more carefully.

In theory, "set dedupe" approach should be slower than "no external" as we are 
iterating through shared dependencies. But in practice it isn't, which is 
puzzling. My current theory is that "no external" also feeds into correctness, 
so we might be doing more [correct] method overloading checks.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-28 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

Avoiding serializing external methods in `ASTMethodPoolTrait` is working 
successfully, although the times are not as good as using set deduplication:

| **file**   | **baseline** | **set dedupe** | **no external** |
| Total  | 149.68   | 92.97  | 109.83  |
| IGMFVC.mm  | 25.75| 5.38   | 6.35|
| IGMFLADS.m | 15.97| 3.30   | 4.61|
| IGIMFVC.mm | 6.70 | 2.98   | 3.77|
| IGSFCVC.m  | 4.57 | 1.53   | 1.87|
| IGVFVC.mm  | 4.19 | 1.61   | 1.85|
| IGMFAHC.mm | 4.00 | 1.93   | 2.41|
| IGMFLCVC.m | 3.89 | 1.55   | 1.96|
| PLRFPPC.mm | 3.19 | 3.26   | 4.05|
| IGBFPVC.m  | 3.18 | 1.11   | 1.36|
|

Where the no external numbers are using D110648 



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-28 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3022209 , @vsapsai wrote:

> That patch looks correct, I was experimenting with exactly the same local 
> change. Have you tried D110123  on your 
> original build? In my testing with synthetic test case it looks as good as 
> set deduplication.

I am getting some pretty unexpected results from this approach. For the files I 
could get to complete below are the compilation times for each approach for the 
10 slowest (in the baseline) files:

| **file**   | **baseline** | **set deduplication** | **no external methods in 
pcm** |
| Total  | 95.57| 75.30 | 306.46
 |
| IGSFCVC.m  | 4.57 | 1.53  | 13.71 
 |
| IGVFVC.mm  | 4.19 | 1.61  | 27.93 
 |
| IGMFAHC.mm | 4.00 | 1.93  | 9.94  
 |
| IGMFLCVC.m | 3.89 | 1.55  | 12.30 
 |
| PLRFPPC.mm | 3.19 | 3.26  | 3.99  
 |
| IGBFPVC.m  | 3.18 | 1.11  | 28.92 
 |
| IGBVC.m| 2.37 | 0.98  | 17.42 
 |
| PLRRM.mm   | 1.99 | 1.99  | 2.31  
 |
| IGBSPSC.m  | 1.88 | 0.85  | 14.38 
 |
|

All of the most interesting files were left out of these results as I could not 
get compilation to complete for the no external methods approach.

> I still think it is worth to pursue writing only owned methods in 
> `METHOD_POOL`. Deduplication with a DenseSet works but I still expect it to 
> be more expensive than avoiding duplicates by construction. As for the next 
> step I was thinking about changing `ASTMethodPoolTrait` to skip transitive 
> methods during serialization and it should help with eliminating all the 
> duplicates. With that change we can test and see how it works.

I will take a go at this approach to eliminate all external methods and compare 
to see if this exhibits similar behaviour.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-24 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3021644 , @rmaz wrote:

> @vsapsai just checking on where we are with this one. Is the current plan to 
> investigate an approach only serializing the methods declared in each module, 
> or are we good to go ahead with the set de-duplication approach? I tried 
> profiling with D110123  and with the 
> following patch:
>
>   diff --git a/clang/lib/Serialization/ASTReader.cpp 
> b/clang/lib/Serialization/ASTReader.cpp
>   index 279067a8ac8b..aaaca0aff9ab 100644
>   --- a/clang/lib/Serialization/ASTReader.cpp
>   +++ b/clang/lib/Serialization/ASTReader.cpp
>   @@ -8130,7 +8130,7 @@ namespace serialization {
>  FactoryBits = Data.FactoryBits;
>  InstanceHasMoreThanOneDecl = Data.InstanceHasMoreThanOneDecl;
>  FactoryHasMoreThanOneDecl = Data.FactoryHasMoreThanOneDecl;
>   -  return true;
>   +  return false;
>}
>
> but this was a fair bit slower than the deduplication approach, and for some 
> files would never complete, presumably stuck in an infinite loop of dependent 
> modules. Is there a way to exclude the serialization of methods from 
> dependent modules but make the method lookup more efficient somehow?

That patch looks correct, I was experimenting with exactly the same local 
change. Have you tried D110123  on your 
original build? In my testing with synthetic test case it looks as good as set 
deduplication. In its current state D110123  
isn't really representative for actual code bases because it relies on method 
lists of specific shape (all non-transitive methods coming after all transitive 
methods). You can try to add deduplication stats to see that there is less 
deduplication but it's not entirely eliminated. At least that's my 
understanding which can be flawed. Though the fact that clang didn't complete 
for some files is super strange and concerning, have no idea why is it 
happening and will need to experiment on a real-life project.

I still think it is worth to pursue writing only owned methods in 
`METHOD_POOL`. Deduplication with a DenseSet works but I still expect it to be 
more expensive than avoiding duplicates by construction. As for the next step I 
was thinking about changing `ASTMethodPoolTrait` to skip transitive methods 
during serialization and it should help with eliminating all the duplicates. 
With that change we can test and see how it works.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-24 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3013620 , @dexonsmith 
wrote:

> In D109632#3013523 , @vsapsai wrote:
>
>> 2. Serialize only methods owned by the current module (and change 
>> `ReadMethodPoolVisitor` appropriately).
>
> Would that require visiting all in-memory modules every time there's a global 
> method pool lookup?
>
> If so, that sounds expensive... and similar to the problem that had to be 
> solved to make identifier lookup fast. Maybe the same approach can/should be 
> used here?
>
> Or if not, can you clarify for me?

We use `ReadMethodPoolVisitor` only from `ASTReader::ReadMethodPool`. I don't 
know all the cases when it is called but it is called at least on encountering 
`@end` to handle methods declared in an interface. With the proposed change the 
behavior is the following:

- the first time we encounter a selector, we walk the entire transitive closure 
of dependencies
- when `ReadMethodPool` is called again for the same selector, we check only 
immediate dependencies but exit early and don't drill into transitive 
dependencies due to

  // If we've already searched this module file, skip it now.
  if (M.Generation <= PriorGeneration)
return true;

- when we add a new dependency and trigger `ReadMethodPool` again, we visit 
only the added dependency and its unvisited transitive dependencies. We exit 
early for already visited modules.

So from the module visiting perspective there are no regressions. I'm not super 
happy with repeated early returns for immediate dependencies but we were doing 
it already. This behavior seems to be orthogonal and we can address it 
separately if needed.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-24 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

@vsapsai just checking on where we are with this one. Is the current plan to 
investigate an approach only serializing the methods declared in each module, 
or are we good to go ahead with the set de-duplication approach? I tried 
profiling with D110123  and with the 
following patch:

  diff --git a/clang/lib/Serialization/ASTReader.cpp 
b/clang/lib/Serialization/ASTReader.cpp
  index 279067a8ac8b..aaaca0aff9ab 100644
  --- a/clang/lib/Serialization/ASTReader.cpp
  +++ b/clang/lib/Serialization/ASTReader.cpp
  @@ -8130,7 +8130,7 @@ namespace serialization {
 FactoryBits = Data.FactoryBits;
 InstanceHasMoreThanOneDecl = Data.InstanceHasMoreThanOneDecl;
 FactoryHasMoreThanOneDecl = Data.FactoryHasMoreThanOneDecl;
  -  return true;
  +  return false;
   }

but this was a fair bit slower than the deduplication approach, and for some 
files would never complete, presumably stuck in an infinite loop of dependent 
modules. Is there a way to exclude the serialization of methods from dependent 
modules but make the method lookup more efficient somehow?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-21 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a comment.

In D109632#3013523 , @vsapsai wrote:

> 2. Serialize only methods owned by the current module (and change 
> `ReadMethodPoolVisitor` appropriately).

Would that require visiting all in-memory modules every time there's a global 
method pool lookup?

If so, that sounds expensive... and similar to the problem that had to be 
solved to make identifier lookup fast. Maybe the same approach can/should be 
used here?

Or if not, can you clarify for me?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-21 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3012647 , @rmaz wrote:

>> What folks are thinking about writing less in METHOD_POOL?
>
> I prefer the idea of it, but I think the `ReadMethodPoolVisitor` also has to 
> be changed for this to work. When it finds a selector in a module it will 
> return true, which causes the search to stop descending into dependent 
> modules:
>
>   if (!Visitor(*CurrentModule))
> continue;
>   
>   // The visitor has requested that cut off visitation of any
>   // module that the current module depends on. To indicate this
>   // behavior, we mark all of the reachable modules as having been visited.
>
> Wouldn't this logic have to be changed to ensure we pick up all the 
> transitive methods from dependent modules?

You are right, suggested implementation is insufficient and I've added a test 
to https://reviews.llvm.org/D110123 to catch it.

Right now `MODULE_POOL` seems to be caching all transitive modules but it does 
so poorly in case of shared dependencies and sometimes the cache is empty 
forcing you to collect methods from imported modules. So it looks like we are 
somewhere in the middle on caching --- non-caching spectrum. I'd prefer to go 
entirely to one of the ends and for that I see 2 options:

1. Deduplicate methods from shared dependencies when reading method pools from 
imported modules.
2. Serialize only methods owned by the current module (and change 
`ReadMethodPoolVisitor` appropriately).

I think the option 2 would be better as it seems to be easier to understand. 
Implementation complexity seems to be comparable, runtime seems to be better 
for option 2. Also given that collecting methods from a module is basically 
`InstanceMethods.append(Data.Instance.begin(), Data.Instance.end())`, I don't 
think caching methods from transitive dependencies saves us processing time.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-21 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

> What folks are thinking about writing less in METHOD_POOL?

I prefer the idea of it, but I think the `ReadMethodPoolVisitor` also has to be 
changed for this to work. When it finds a selector in a module it will return 
true, which causes the search to stop descending into dependent modules:

  if (!Visitor(*CurrentModule))
continue;
  
  // The visitor has requested that cut off visitation of any
  // module that the current module depends on. To indicate this
  // behavior, we mark all of the reachable modules as having been visited.

Wouldn't this logic have to be changed to ensure we pick up all the transitive 
methods from dependent modules?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-20 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

Looks like in `METHOD_POOL` we don't need the transitive closure for 
performance reasons. And we are kinda trying to store only data the module owns

>   // Only write this selector if it's not in an existing AST or something
>   // changed.

But even if a single method is in the module we are building, we'd serialize 
entire list of methods for this selector. I've tried a proof of concept 
https://reviews.llvm.org/D110123 to see what happens if we are more aggressive 
with skipping methods coming from other modules. It's not entirely correct as 
we are missing some transitive methods and I have doubts it is the best 
implementation. But for me it cuts down the compilation time of the synthetic 
test case and we aren't dealing with any duplicates, so that seems like a 
viable approach. In bad news this change doesn't fail any tests, so looks like 
this area isn't sufficiently well tested.

What folks are thinking about writing less in `METHOD_POOL`?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-20 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1434-1436
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}

rmaz wrote:
> rmaz wrote:
> > dexonsmith wrote:
> > > Hmm, I was imagining that the set would be more encapsulated than this, 
> > > not just stored in the same place.
> > > 
> > > I'm wondering if the following could be done in a prep commit:
> > > 
> > > - Change Sema::addMethodToGlobalList to a private member function of 
> > > GlobalMethodPool.
> > > - Make GlobalMethodPool::insert private
> > > - Add `GlobalMethodPool::addMethod(ObjCMethodDecl*,bool,bool)`, which 
> > > does the second half of Sema::AddMethodToGlobalPool (the parts that don't 
> > > need Sema's other fields), and change the latter to call the former.
> > > 
> > > WDYT?
> > Definitely neater, will take a look at this later today.
> This might need a slightly different approach, as for the insert use case we 
> have:
> 
> ```lang=cxx
> Sema  = *getSema();
>   Sema::GlobalMethodPool::iterator Pos =
>   S.MethodPool.insert(std::make_pair(Sel, 
> Sema::GlobalMethodPool::Lists()))
>   .first;
> 
>   Pos->second.first.setBits(Visitor.getInstanceBits());
>   
> Pos->second.first.setHasMoreThanOneDecl(Visitor.instanceHasMoreThanOneDecl());
>   Pos->second.second.setBits(Visitor.getFactoryBits());
>   
> Pos->second.second.setHasMoreThanOneDecl(Visitor.factoryHasMoreThanOneDecl());
> 
>   // Add methods to the global pool *after* setting hasMoreThanOneDecl, since
>   // when building a module we keep every method individually and may need to
>   // update hasMoreThanOneDecl as we add the methods.
>   addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.first);
>   addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.second);
> ```
> 
> At the moment we fetch a method list, modify its state and then start 
> inserting the methods. If we move to something like 
> `MethodPool.addMethod(ObjCMethodDecl *)` we will have to look up the method 
> list for each insert, and we would need extra methods to set the state first 
> on the method list. How about something like 
> `MethodPool.addMethods(ArrayRef methods, unsigned 
> instanceBits, bool hasMoreThanOneDecl)`? Then we only need two list lookups 
> per selector as before and we can handle the list state update before insert.
>  
That seems reasonable to me.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-20 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a comment.

In D109632#3007615 , @vsapsai wrote:

> I don't remember for sure but I don't think there is a consistent policy 
> about a module storing transitive data or only data it owns. I suspect we 
> might be using both approaches and need to check each case separately.

I imagine we ideally want only owned data, except where there's a significant 
performance benefit for caching transitive data. I know identifier lookup 
tables include transitive data if there are "enough" transitive modules to 
justify it (see @rsmith's commits from 6-8 years ago related to multi-file 
on-disk hash tables).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-17 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3006520 , @rmaz wrote:

> The case we have is more like:
>
>   .m   -> A -> long list of partially shared deps -> Foundation
>-> B -> long list of partially shared deps -> Foundation
>-> C -> long list of partially shared deps -> Foundation
> * a few hundred
>
> So we have a file that imports a lot of modules, in the hundreds. Each of 
> those modules has multiple ObjC interfaces with `-(id)init NS_UNAVAILABLE` 
> and imports Foundation, UIKit and also a large number of libraries that are 
> shared across the top level imports. This will result in A.pcm, B.pcm and 
> C.pcm including hundreds or thousands of init decls that are the same, from 
> system frameworks or whatever modules are shared between the top level 
> imports.
>
> IIUC the code currently serializes the entire ObjCMethodList for a module for 
> every declared method, including the methods that are not part of that 
> module. When deserializing we don't descend into module dependencies as the 
> entire method list would already be deserialized, but that doesn't help for 
> modules that aren't directly dependent. Is this right? If so it seems another 
> approach could be to only serialize the methods declared in that module 
> itself, and during deserialization we would have to load the methods from all 
> dependent modules.

I have created a different test synthesizer 
synthesize-shared-framework-chain-test.py to reproduce the described framework 
layout. In `addMethodsToPool` added a set `seen` to count how many methods we 
can skip and we don't skip anything in the modules but there are a lot of 
duplicates in .m file compilation. Also I've noticed that the size of .pcm 
increases the longer chain of modules is reachable from the module (19K for 
Shared0.pcm vs 38K for Shared49.pcm). Haven't checked if it is because of 
ObjCMethodList or for another reason.

I don't remember for sure but I don't think there is a consistent policy about 
a module storing transitive data or only data it owns. I suspect we might be 
using both approaches and need to check each case separately.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-17 Thread Richard Howell via Phabricator via cfe-commits
rmaz added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1434-1436
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}

rmaz wrote:
> dexonsmith wrote:
> > Hmm, I was imagining that the set would be more encapsulated than this, not 
> > just stored in the same place.
> > 
> > I'm wondering if the following could be done in a prep commit:
> > 
> > - Change Sema::addMethodToGlobalList to a private member function of 
> > GlobalMethodPool.
> > - Make GlobalMethodPool::insert private
> > - Add `GlobalMethodPool::addMethod(ObjCMethodDecl*,bool,bool)`, which does 
> > the second half of Sema::AddMethodToGlobalPool (the parts that don't need 
> > Sema's other fields), and change the latter to call the former.
> > 
> > WDYT?
> Definitely neater, will take a look at this later today.
This might need a slightly different approach, as for the insert use case we 
have:

```lang=cxx
Sema  = *getSema();
  Sema::GlobalMethodPool::iterator Pos =
  S.MethodPool.insert(std::make_pair(Sel, Sema::GlobalMethodPool::Lists()))
  .first;

  Pos->second.first.setBits(Visitor.getInstanceBits());
  Pos->second.first.setHasMoreThanOneDecl(Visitor.instanceHasMoreThanOneDecl());
  Pos->second.second.setBits(Visitor.getFactoryBits());
  Pos->second.second.setHasMoreThanOneDecl(Visitor.factoryHasMoreThanOneDecl());

  // Add methods to the global pool *after* setting hasMoreThanOneDecl, since
  // when building a module we keep every method individually and may need to
  // update hasMoreThanOneDecl as we add the methods.
  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.first);
  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.second);
```

At the moment we fetch a method list, modify its state and then start inserting 
the methods. If we move to something like `MethodPool.addMethod(ObjCMethodDecl 
*)` we will have to look up the method list for each insert, and we would need 
extra methods to set the state first on the method list. How about something 
like `MethodPool.addMethods(ArrayRef methods, unsigned 
instanceBits, bool hasMoreThanOneDecl)`? Then we only need two list lookups per 
selector as before and we can handle the list state update before insert.
 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-17 Thread Richard Howell via Phabricator via cfe-commits
rmaz added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1434-1436
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}

dexonsmith wrote:
> Hmm, I was imagining that the set would be more encapsulated than this, not 
> just stored in the same place.
> 
> I'm wondering if the following could be done in a prep commit:
> 
> - Change Sema::addMethodToGlobalList to a private member function of 
> GlobalMethodPool.
> - Make GlobalMethodPool::insert private
> - Add `GlobalMethodPool::addMethod(ObjCMethodDecl*,bool,bool)`, which does 
> the second half of Sema::AddMethodToGlobalPool (the parts that don't need 
> Sema's other fields), and change the latter to call the former.
> 
> WDYT?
Definitely neater, will take a look at this later today.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-17 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3005512 , @vsapsai wrote:

> Thanks for the explanation! I'm still curious to reproduce the problem 
> locally and have created a test case generator 
> https://gist.github.com/vsapsai/f9d3603dde95eebd23248da4d7b4f5ec It creates a 
> chain of .m -> Synthesized9 -> Synthesized8 -> Synthesized7 ->... Does it 
> represent the structure of the code you are dealing with?

The case we have is more like:

  .m   -> A -> long list of partially shared deps -> Foundation
   -> B -> long list of partially shared deps -> Foundation
   -> C -> long list of partially shared deps -> Foundation
    * a few hundred

So we have a file that imports a lot of modules, in the hundreds. Each of those 
modules has multiple ObjC interfaces with `-(id)init NS_UNAVAILABLE` and 
imports Foundation, UIKit and also a large number of libraries that are shared 
across the top level imports. This will result in A.pcm, B.pcm and C.pcm 
including hundreds or thousands of init decls that are the same, from system 
frameworks or whatever modules are shared between the top level imports.

IIUC the code currently serializes the entire ObjCMethodList for a module for 
every declared method, including the methods that are not part of that module. 
When deserializing we don't descend into module dependencies as the entire 
method list would already be deserialized, but that doesn't help for modules 
that aren't directly dependent. Is this right? If so it seems another approach 
could be to only serialize the methods declared in that module itself, and 
during deserialization we would have to load the methods from all dependent 
modules.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-16 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

In D109632#3000709 , @rmaz wrote:

> The speedup is coming from reducing the number of times 
> `Sema::addMethodToGlobalList` is called when looking up methods loaded from 
> modules. From what I can see each serialized module will contain an 
> ObjCMethodList with all the methods from all visible modules at the point of 
> serialization.

Thanks for the explanation! I'm still curious to reproduce the problem locally 
and have created a test case generator 
https://gist.github.com/vsapsai/f9d3603dde95eebd23248da4d7b4f5ec It creates a 
chain of .m -> Synthesized9 -> Synthesized8 -> Synthesized7 ->... Does it 
represent the structure of the code you are dealing with?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-16 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1434-1436
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}

Hmm, I was imagining that the set would be more encapsulated than this, not 
just stored in the same place.

I'm wondering if the following could be done in a prep commit:

- Change Sema::addMethodToGlobalList to a private member function of 
GlobalMethodPool.
- Make GlobalMethodPool::insert private
- Add `GlobalMethodPool::addMethod(ObjCMethodDecl*,bool,bool)`, which does the 
second half of Sema::AddMethodToGlobalPool (the parts that don't need Sema's 
other fields), and change the latter to call the former.

WDYT?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-16 Thread Richard Howell via Phabricator via cfe-commits
rmaz updated this revision to Diff 373072.
rmaz added a comment.

Update with single DenseSet per GlobalMethodPool


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

Files:
  clang/include/clang/Sema/Sema.h
  clang/lib/Sema/SemaDeclObjC.cpp


Index: clang/lib/Sema/SemaDeclObjC.cpp
===
--- clang/lib/Sema/SemaDeclObjC.cpp
+++ clang/lib/Sema/SemaDeclObjC.cpp
@@ -3302,6 +3302,10 @@
 
 void Sema::addMethodToGlobalList(ObjCMethodList *List,
  ObjCMethodDecl *Method) {
+  // Do not insert duplicate methods into the method pool.
+  if (!MethodPool.addMethod(Method))
+return;
+
   // Record at the head of the list whether there were 0, 1, or >= 2 methods
   // inside categories.
   if (ObjCCategoryDecl *CD =
Index: clang/include/clang/Sema/Sema.h
===
--- clang/include/clang/Sema/Sema.h
+++ clang/include/clang/Sema/Sema.h
@@ -1431,9 +1431,13 @@
 }
 int count(Selector Sel) const { return Methods.count(Sel); }
 bool empty() const { return Methods.empty(); }
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}
 
   private:
 llvm::DenseMap Methods;
+llvm::DenseSet AddedMethods;
   };
 
   /// Method Pool - allows efficient lookup when typechecking messages to "id".


Index: clang/lib/Sema/SemaDeclObjC.cpp
===
--- clang/lib/Sema/SemaDeclObjC.cpp
+++ clang/lib/Sema/SemaDeclObjC.cpp
@@ -3302,6 +3302,10 @@
 
 void Sema::addMethodToGlobalList(ObjCMethodList *List,
  ObjCMethodDecl *Method) {
+  // Do not insert duplicate methods into the method pool.
+  if (!MethodPool.addMethod(Method))
+return;
+
   // Record at the head of the list whether there were 0, 1, or >= 2 methods
   // inside categories.
   if (ObjCCategoryDecl *CD =
Index: clang/include/clang/Sema/Sema.h
===
--- clang/include/clang/Sema/Sema.h
+++ clang/include/clang/Sema/Sema.h
@@ -1431,9 +1431,13 @@
 }
 int count(Selector Sel) const { return Methods.count(Sel); }
 bool empty() const { return Methods.empty(); }
+bool addMethod(ObjCMethodDecl *Method) {
+  return AddedMethods.insert(Method).second;
+}
 
   private:
 llvm::DenseMap Methods;
+llvm::DenseSet AddedMethods;
   };
 
   /// Method Pool - allows efficient lookup when typechecking messages to "id".
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-16 Thread Richard Howell via Phabricator via cfe-commits
rmaz added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1423-1426
+ObjCMethodList InstanceMethods;
+ObjCMethodList FactoryMethods;
+llvm::DenseSet AddedInstanceMethods;
+llvm::DenseSet AddedFactoryMethods;

dexonsmith wrote:
> Do these two sets really need to be per-selector (per `GlobalMethods` object 
> in GlobalMethodPool)? I feel like they could be global to `Sema`, as long as 
> the same `ObjCMethodDecl` is never added to GlobalMethodPool for more than 
> one selector. (It seems plausible we have that property since presumably 
> ObjCMethodDecl::getSelector is always the key used for inserting into 
> GlobalMethodPool below... can you confirm?)
> 
> Assuming you can pull the sets out to represent the GlobalMethodPool as a 
> whole, then I suggest creating a data structure for GlobalMethodPool that 
> encapsulates the DenseMap and the two sets (maybe: rename "GlobalMethods" to 
> "GlobalMethodLists", rename "GlobalMethodPool" to "GlobalMethods", and give 
> the new data structure the name "GlobalMethodPool"). Best to do it as 
> multiple commits: NFC commit(s) to refactor (renames, create new type update 
> call sites to use new APIs, etc.), and a final commit that changes the 
> functionality (adding the set behaviour) but that doesn't need to touch call 
> sites.
> 
> On the other hand, if the sets really need to be per-selector (please explain 
> why if so...), please use SmallPtrSet here instead of DenseSet to avoid 
> allocation in the common case of 1 decl per selector. I'd suggest 
> encapsulating the list/set together somehow (maybe starting with an NFC 
> refactoring to add a "list/set" data structure at the top-level (maybe rename 
> ObjCMethodList => ObjCMethodListNode, and the new list has a private member 
> called "Head" and necessary APIs for insertion/etc), then in a second commit 
> adding the SmallPtrSet into the list data structure and use it to gate the 
> "insert" operation).
> 
> I feel like they could be global to Sema, as long as the same ObjCMethodDecl 
> is never added to GlobalMethodPool for more than one selector. (It seems 
> plausible we have that property since presumably ObjCMethodDecl::getSelector 
> is always the key used for inserting into GlobalMethodPool below... can you 
> confirm?)

Yes, that is correct. We could also use a single set instead of two as the 
instance and factory methods would have different decls anyway.

> Assuming you can pull the sets out to represent the GlobalMethodPool as a 
> whole, then I suggest creating a data structure for GlobalMethodPool that 
> encapsulates the DenseMap and the two sets

I'll go with this approach with the single set, starting with a data structure 
refactor diff.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-15 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added inline comments.



Comment at: clang/include/clang/Sema/Sema.h:1423-1426
+ObjCMethodList InstanceMethods;
+ObjCMethodList FactoryMethods;
+llvm::DenseSet AddedInstanceMethods;
+llvm::DenseSet AddedFactoryMethods;

Do these two sets really need to be per-selector (per `GlobalMethods` object in 
GlobalMethodPool)? I feel like they could be global to `Sema`, as long as the 
same `ObjCMethodDecl` is never added to GlobalMethodPool for more than one 
selector. (It seems plausible we have that property since presumably 
ObjCMethodDecl::getSelector is always the key used for inserting into 
GlobalMethodPool below... can you confirm?)

Assuming you can pull the sets out to represent the GlobalMethodPool as a 
whole, then I suggest creating a data structure for GlobalMethodPool that 
encapsulates the DenseMap and the two sets (maybe: rename "GlobalMethods" to 
"GlobalMethodLists", rename "GlobalMethodPool" to "GlobalMethods", and give the 
new data structure the name "GlobalMethodPool"). Best to do it as multiple 
commits: NFC commit(s) to refactor (renames, create new type update call sites 
to use new APIs, etc.), and a final commit that changes the functionality 
(adding the set behaviour) but that doesn't need to touch call sites.

On the other hand, if the sets really need to be per-selector (please explain 
why if so...), please use SmallPtrSet here instead of DenseSet to avoid 
allocation in the common case of 1 decl per selector. I'd suggest encapsulating 
the list/set together somehow (maybe starting with an NFC refactoring to add a 
"list/set" data structure at the top-level (maybe rename ObjCMethodList => 
ObjCMethodListNode, and the new list has a private member called "Head" and 
necessary APIs for insertion/etc), then in a second commit adding the 
SmallPtrSet into the list data structure and use it to gate the "insert" 
operation).



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Richard Howell via Phabricator via cfe-commits
rmaz updated this revision to Diff 372598.
rmaz added a comment.

update to fix lint warnings


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

Files:
  clang/include/clang/Sema/Sema.h
  clang/lib/Sema/SemaCodeComplete.cpp
  clang/lib/Sema/SemaDeclObjC.cpp
  clang/lib/Sema/SemaExprObjC.cpp
  clang/lib/Serialization/ASTReader.cpp
  clang/lib/Serialization/ASTWriter.cpp

Index: clang/lib/Serialization/ASTWriter.cpp
===
--- clang/lib/Serialization/ASTWriter.cpp
+++ clang/lib/Serialization/ASTWriter.cpp
@@ -3122,8 +3122,8 @@
 ObjCMethodList()
   };
   if (F != SemaRef.MethodPool.end()) {
-Data.Instance = F->second.first;
-Data.Factory = F->second.second;
+Data.Instance = F->second.InstanceMethods;
+Data.Factory = F->second.FactoryMethods;
   }
   // Only write this selector if it's not in an existing AST or something
   // changed.
Index: clang/lib/Serialization/ASTReader.cpp
===
--- clang/lib/Serialization/ASTReader.cpp
+++ clang/lib/Serialization/ASTReader.cpp
@@ -4006,8 +4006,9 @@
 return;
 
   // Retrieve the appropriate method list.
-  ObjCMethodList  = Method->isInstanceMethod()? Known->second.first
-: Known->second.second;
+  ObjCMethodList  = Method->isInstanceMethod()
+  ? Known->second.InstanceMethods
+  : Known->second.FactoryMethods;
   bool Found = false;
   for (ObjCMethodList *List =  List; List = List->getNext()) {
 if (!Found) {
@@ -8180,9 +8181,10 @@
 
 /// Add the given set of methods to the method list.
 static void addMethodsToPool(Sema , ArrayRef Methods,
- ObjCMethodList ) {
-  for (unsigned I = 0, N = Methods.size(); I != N; ++I) {
-S.addMethodToGlobalList(, Methods[I]);
+ ObjCMethodList ,
+ llvm::DenseSet ) {
+  for (auto *M : Methods) {
+S.addMethodToGlobalList(, , M);
   }
 }
 
@@ -8211,16 +8213,20 @@
   Sema::GlobalMethodPool::iterator Pos
 = S.MethodPool.insert(std::make_pair(Sel, Sema::GlobalMethods())).first;
 
-  Pos->second.first.setBits(Visitor.getInstanceBits());
-  Pos->second.first.setHasMoreThanOneDecl(Visitor.instanceHasMoreThanOneDecl());
-  Pos->second.second.setBits(Visitor.getFactoryBits());
-  Pos->second.second.setHasMoreThanOneDecl(Visitor.factoryHasMoreThanOneDecl());
+  Pos->second.InstanceMethods.setBits(Visitor.getInstanceBits());
+  Pos->second.InstanceMethods.setHasMoreThanOneDecl(
+  Visitor.instanceHasMoreThanOneDecl());
+  Pos->second.FactoryMethods.setBits(Visitor.getFactoryBits());
+  Pos->second.FactoryMethods.setHasMoreThanOneDecl(
+  Visitor.factoryHasMoreThanOneDecl());
 
   // Add methods to the global pool *after* setting hasMoreThanOneDecl, since
   // when building a module we keep every method individually and may need to
   // update hasMoreThanOneDecl as we add the methods.
-  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.first);
-  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.second);
+  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.InstanceMethods,
+   Pos->second.AddedInstanceMethods);
+  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.FactoryMethods,
+   Pos->second.AddedFactoryMethods);
 }
 
 void ASTReader::updateOutOfDateSelector(Selector Sel) {
Index: clang/lib/Sema/SemaExprObjC.cpp
===
--- clang/lib/Sema/SemaExprObjC.cpp
+++ clang/lib/Sema/SemaExprObjC.cpp
@@ -1215,13 +1215,13 @@
   for (Sema::GlobalMethodPool::iterator b = S.MethodPool.begin(),
e = S.MethodPool.end(); b != e; b++) {
 // first, instance methods
-ObjCMethodList  = b->second.first;
+ObjCMethodList  = b->second.InstanceMethods;
 if (HelperToDiagnoseMismatchedMethodsInGlobalPool(S, AtLoc, LParenLoc, RParenLoc,
   Method, InstMethList))
   Warned = true;
 
 // second, class methods
-ObjCMethodList  = b->second.second;
+ObjCMethodList  = b->second.FactoryMethods;
 if (HelperToDiagnoseMismatchedMethodsInGlobalPool(S, AtLoc, LParenLoc, RParenLoc,
   Method, ClsMethList) || Warned)
   return;
@@ -1262,9 +1262,9 @@
 return nullptr;
 
   ObjCMethodDecl *DirectInstance = LookupDirectMethodInMethodList(
-  S, Sel, Iter->second.first, onlyDirect, anyDirect);
+  S, Sel, Iter->second.InstanceMethods, onlyDirect, anyDirect);
   ObjCMethodDecl *DirectClass = LookupDirectMethodInMethodList(
-  S, Sel, Iter->second.second, onlyDirect, anyDirect);
+  S, Sel, 

[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Richard Howell via Phabricator via cfe-commits
rmaz updated this revision to Diff 372597.
rmaz added a comment.

Updated to use a GlobalMethods struct which contains the global
method lists as well as the sets used to de-duplicate added methods.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

Files:
  clang/include/clang/Sema/Sema.h
  clang/lib/Sema/SemaCodeComplete.cpp
  clang/lib/Sema/SemaDeclObjC.cpp
  clang/lib/Sema/SemaExprObjC.cpp
  clang/lib/Serialization/ASTReader.cpp
  clang/lib/Serialization/ASTWriter.cpp

Index: clang/lib/Serialization/ASTWriter.cpp
===
--- clang/lib/Serialization/ASTWriter.cpp
+++ clang/lib/Serialization/ASTWriter.cpp
@@ -3122,8 +3122,8 @@
 ObjCMethodList()
   };
   if (F != SemaRef.MethodPool.end()) {
-Data.Instance = F->second.first;
-Data.Factory = F->second.second;
+Data.Instance = F->second.InstanceMethods;
+Data.Factory = F->second.FactoryMethods;
   }
   // Only write this selector if it's not in an existing AST or something
   // changed.
Index: clang/lib/Serialization/ASTReader.cpp
===
--- clang/lib/Serialization/ASTReader.cpp
+++ clang/lib/Serialization/ASTReader.cpp
@@ -4006,8 +4006,9 @@
 return;
 
   // Retrieve the appropriate method list.
-  ObjCMethodList  = Method->isInstanceMethod()? Known->second.first
-: Known->second.second;
+  ObjCMethodList  = Method->isInstanceMethod()
+  ? Known->second.InstanceMethods
+  : Known->second.FactoryMethods;
   bool Found = false;
   for (ObjCMethodList *List =  List; List = List->getNext()) {
 if (!Found) {
@@ -8180,9 +8181,10 @@
 
 /// Add the given set of methods to the method list.
 static void addMethodsToPool(Sema , ArrayRef Methods,
- ObjCMethodList ) {
-  for (unsigned I = 0, N = Methods.size(); I != N; ++I) {
-S.addMethodToGlobalList(, Methods[I]);
+ ObjCMethodList ,
+ llvm::DenseSet ) {
+  for (auto *M : Methods) {
+S.addMethodToGlobalList(, , M);
   }
 }
 
@@ -8211,16 +8213,20 @@
   Sema::GlobalMethodPool::iterator Pos
 = S.MethodPool.insert(std::make_pair(Sel, Sema::GlobalMethods())).first;
 
-  Pos->second.first.setBits(Visitor.getInstanceBits());
-  Pos->second.first.setHasMoreThanOneDecl(Visitor.instanceHasMoreThanOneDecl());
-  Pos->second.second.setBits(Visitor.getFactoryBits());
-  Pos->second.second.setHasMoreThanOneDecl(Visitor.factoryHasMoreThanOneDecl());
+  Pos->second.InstanceMethods.setBits(Visitor.getInstanceBits());
+  Pos->second.InstanceMethods.setHasMoreThanOneDecl(
+  Visitor.instanceHasMoreThanOneDecl());
+  Pos->second.FactoryMethods.setBits(Visitor.getFactoryBits());
+  Pos->second.FactoryMethods.setHasMoreThanOneDecl(
+  Visitor.factoryHasMoreThanOneDecl());
 
   // Add methods to the global pool *after* setting hasMoreThanOneDecl, since
   // when building a module we keep every method individually and may need to
   // update hasMoreThanOneDecl as we add the methods.
-  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.first);
-  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.second);
+  addMethodsToPool(S, Visitor.getInstanceMethods(), Pos->second.InstanceMethods,
+   Pos->second.AddedInstanceMethods);
+  addMethodsToPool(S, Visitor.getFactoryMethods(), Pos->second.FactoryMethods,
+   Pos->second.AddedFactoryMethods);
 }
 
 void ASTReader::updateOutOfDateSelector(Selector Sel) {
Index: clang/lib/Sema/SemaExprObjC.cpp
===
--- clang/lib/Sema/SemaExprObjC.cpp
+++ clang/lib/Sema/SemaExprObjC.cpp
@@ -1215,13 +1215,13 @@
   for (Sema::GlobalMethodPool::iterator b = S.MethodPool.begin(),
e = S.MethodPool.end(); b != e; b++) {
 // first, instance methods
-ObjCMethodList  = b->second.first;
+ObjCMethodList  = b->second.InstanceMethods;
 if (HelperToDiagnoseMismatchedMethodsInGlobalPool(S, AtLoc, LParenLoc, RParenLoc,
   Method, InstMethList))
   Warned = true;
 
 // second, class methods
-ObjCMethodList  = b->second.second;
+ObjCMethodList  = b->second.FactoryMethods;
 if (HelperToDiagnoseMismatchedMethodsInGlobalPool(S, AtLoc, LParenLoc, RParenLoc,
   Method, ClsMethList) || Warned)
   return;
@@ -1262,9 +1262,9 @@
 return nullptr;
 
   ObjCMethodDecl *DirectInstance = LookupDirectMethodInMethodList(
-  S, Sel, Iter->second.first, onlyDirect, anyDirect);
+  S, Sel, Iter->second.InstanceMethods, onlyDirect, anyDirect);
   ObjCMethodDecl *DirectClass = 

[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Richard Howell via Phabricator via cfe-commits
rmaz added a comment.

In D109632#3000469 , @vsapsai wrote:

> I'm a little bit confused here. Where is the speed-up coming from? Is it 
> because ObjCMethodList for `init` not being serialized? I'm asking because 
> right now I don't see how `seen` helps with that.
>
> My current understanding is that `Sema::addMethodToGlobalList` is too 
> expensive and `seen` reduces the number of calls to it. So maybe it makes 
> sense to invest into making it faster? For example, 
> `Sema::MatchTwoMethodDeclarations` can return early if both method decl 
> pointers are equal. But I haven't done any measurements, that's just an 
> example.

The speedup is coming from reducing the number of times 
`Sema::addMethodToGlobalList` is called when looking up methods loaded from 
modules. From what I can see each serialized module will contain an 
ObjCMethodList with all the methods from all visible modules at the point of 
serialization. This means that if you have a lot of modules that redeclare 
`-(id)init`, which is a common pattern in our code, then each serialized module 
will contain a lot of shared method decls, which we do not de-duplicate. To 
illustrate this, here are the top 5 instance method counts sorted by amount of 
duplicates (using pointer comparison) from a pass of the 
`ReadMethodPoolVisitor` for a single file:

| **selector** | **# methods** | **# unique** | **# duplicated** |
| init | 9825  | 1652 | 8173 |
| init | 2155  | 930  | 1225 |
| copy | 1398  | 633  | 765  |
| init | 1027  | 270  | 757  |
| init | 948   | 410  | 538  |
|

Without having a set to deduplicate I'm not sure how we could make 
`Sema::addMethodToGlobalList` fast enough, wouldn't it require a list traversal 
for each insert, making it O(n^2)?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Volodymyr Sapsai via Phabricator via cfe-commits
vsapsai added a comment.

I'm a little bit confused here. Where is the speed-up coming from? Is it 
because ObjCMethodList for `init` not being serialized? I'm asking because 
right now I don't see how `seen` helps with that.

My current understanding is that `Sema::addMethodToGlobalList` is too expensive 
and `seen` reduces the number of calls to it. So maybe it makes sense to invest 
into making it faster? For example, `Sema::MatchTwoMethodDeclarations` can 
return early if both method decl pointers are equal. But I haven't done any 
measurements, that's just an example.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Richard Howell via Phabricator via cfe-commits
rmaz added inline comments.



Comment at: clang/lib/Serialization/ASTReader.cpp:8188-8190
+  for (auto *L =  L; L = L->getNext()) {
+seen.insert(L->getMethod());
+  }

dexonsmith wrote:
> I find quadratic algorithms a bit scary, even when current benchmarks don't 
> expose problems. For example, it seems like this could blow up on the 
> following workload:
> - one module adds many things to the global method list
> - there are many (fine-grained) modules that transitively load that module
> 
> Or have I misunderstood the complexity here?
Yes, I take your point, if the method pool generation updates inbetween each of 
the later modules then it is possible to hit this case.



Comment at: clang/lib/Serialization/ASTReader.cpp:8194
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}

dexonsmith wrote:
> rmaz wrote:
> > manmanren wrote:
> > > Does it make sense to check for duplication inside addMethodToGlobalList, 
> > > as the function goes through the list as well? Maybe it is slower, as we 
> > > will need to go through the list for each method, instead of a lookup.
> > Yes, you are right, it is slower as we need to do a list traverse per 
> > insert rather than per selector lookup. I also profiled keeping some global 
> > state along with the `MethodPool` so that the set didn't have to be rebuilt 
> > each time, but the performance difference was negligible.
> Can you take another look at the approach you experimented with, putting the 
> global state in the MethodPool? Besides the performance difference (you 
> measured it as negligible but it'd avoid the concern I have about uncommon 
> workloads hitting quadratic blowups), it'd also provide consistency in 
> behaviour between callers of `addMethodToGlobalList`... and enable you to add 
> a unit test for the API change to MethodPool.
I'll update with this approach, it should allow for moving the set insert logic 
into `addMethodToGlobalList` in this case.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Duncan P. N. Exon Smith via Phabricator via cfe-commits
dexonsmith added a reviewer: vsapsai.
dexonsmith added a subscriber: vsapsai.
dexonsmith added a comment.

Thanks for looking at this!

I have a couple of comments inline. @vsapsai, can you also take a look?




Comment at: clang/lib/Serialization/ASTReader.cpp:8188-8190
+  for (auto *L =  L; L = L->getNext()) {
+seen.insert(L->getMethod());
+  }

I find quadratic algorithms a bit scary, even when current benchmarks don't 
expose problems. For example, it seems like this could blow up on the following 
workload:
- one module adds many things to the global method list
- there are many (fine-grained) modules that transitively load that module

Or have I misunderstood the complexity here?



Comment at: clang/lib/Serialization/ASTReader.cpp:8194
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}

rmaz wrote:
> manmanren wrote:
> > Does it make sense to check for duplication inside addMethodToGlobalList, 
> > as the function goes through the list as well? Maybe it is slower, as we 
> > will need to go through the list for each method, instead of a lookup.
> Yes, you are right, it is slower as we need to do a list traverse per insert 
> rather than per selector lookup. I also profiled keeping some global state 
> along with the `MethodPool` so that the set didn't have to be rebuilt each 
> time, but the performance difference was negligible.
Can you take another look at the approach you experimented with, putting the 
global state in the MethodPool? Besides the performance difference (you 
measured it as negligible but it'd avoid the concern I have about uncommon 
workloads hitting quadratic blowups), it'd also provide consistency in 
behaviour between callers of `addMethodToGlobalList`... and enable you to add a 
unit test for the API change to MethodPool.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-14 Thread Manman Ren via Phabricator via cfe-commits
manmanren added a comment.

@dexonsmith @bruno: are you okay with this change? It looks good to me :]

Thanks,
Manman


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-10 Thread Richard Howell via Phabricator via cfe-commits
rmaz added inline comments.



Comment at: clang/lib/Serialization/ASTReader.cpp:8194
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}

manmanren wrote:
> Does it make sense to check for duplication inside addMethodToGlobalList, as 
> the function goes through the list as well? Maybe it is slower, as we will 
> need to go through the list for each method, instead of a lookup.
Yes, you are right, it is slower as we need to do a list traverse per insert 
rather than per selector lookup. I also profiled keeping some global state 
along with the `MethodPool` so that the set didn't have to be rebuilt each 
time, but the performance difference was negligible.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-10 Thread Manman Ren via Phabricator via cfe-commits
manmanren added a comment.

This looks good to me in general. Since it should not change functionality, it 
may not be possible to write a test case.

Manman




Comment at: clang/lib/Serialization/ASTReader.cpp:8194
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}

Does it make sense to check for duplication inside addMethodToGlobalList, as 
the function goes through the list as well? Maybe it is slower, as we will need 
to go through the list for each method, instead of a lookup.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D109632: [clang] de-duplicate methods from AST files

2021-09-10 Thread Richard Howell via Phabricator via cfe-commits
rmaz created this revision.
rmaz requested review of this revision.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

This diff will de-duplicate methods read from AST files before
inserting them in to a global method pool list. When reading
ObjCMethodDecl from AST files we can end up with a significant
amount of duplication when modules contain redeclarations of
system framework methods. For instance a common pattern is to
redeclare `-(instancetype)init` with `NS_UNAVAILABLE`, which
results in the entire ObjCMethodList for `init` being serialized
in each module with this redeclaration.

Measuring this against our codebase for files that use `-fmodules`
shows an overall 19% compile time improvement, and in some cases
as much as 79% for files with a lot of modular dependencies.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D109632

Files:
  clang/lib/Serialization/ASTReader.cpp


Index: clang/lib/Serialization/ASTReader.cpp
===
--- clang/lib/Serialization/ASTReader.cpp
+++ clang/lib/Serialization/ASTReader.cpp
@@ -8181,8 +8181,18 @@
 /// Add the given set of methods to the method list.
 static void addMethodsToPool(Sema , ArrayRef Methods,
  ObjCMethodList ) {
-  for (unsigned I = 0, N = Methods.size(); I != N; ++I) {
-S.addMethodToGlobalList(, Methods[I]);
+  // Methods from visited modules can contain a lot of duplicates
+  // when redeclaring methods from system frameworks, for example
+  // when marking -(instancetype)init as NS_UNAVAILABLE.
+  llvm::DenseSet seen;
+  for (auto *L =  L; L = L->getNext()) {
+seen.insert(L->getMethod());
+  }
+
+  for (auto *M : Methods) {
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}
   }
 }
 


Index: clang/lib/Serialization/ASTReader.cpp
===
--- clang/lib/Serialization/ASTReader.cpp
+++ clang/lib/Serialization/ASTReader.cpp
@@ -8181,8 +8181,18 @@
 /// Add the given set of methods to the method list.
 static void addMethodsToPool(Sema , ArrayRef Methods,
  ObjCMethodList ) {
-  for (unsigned I = 0, N = Methods.size(); I != N; ++I) {
-S.addMethodToGlobalList(, Methods[I]);
+  // Methods from visited modules can contain a lot of duplicates
+  // when redeclaring methods from system frameworks, for example
+  // when marking -(instancetype)init as NS_UNAVAILABLE.
+  llvm::DenseSet seen;
+  for (auto *L =  L; L = L->getNext()) {
+seen.insert(L->getMethod());
+  }
+
+  for (auto *M : Methods) {
+if (seen.insert(M).second) {
+  S.addMethodToGlobalList(, M);
+}
   }
 }
 
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits