[clang] [analyzer] Enforce not making overly complicated symbols (PR #144327)

Donát Nagy via cfe-commits Mon, 16 Jun 2025 11:03:22 -0700

NagyDonat wrote:

> How I read this you have mainly 2 concerns:
> 
>     1. The use of this strong-type makes it tedious the existing APIs to use 
> because one needs to unwrap the value and frequently make an early-return to 
> explicitly handle the case when a symbol-creation failed?


Yes, this is roughly what I mean. In addition to the tediousness of writing the 
_unwrap + early return_ boilerplate I also fear that these early return 
branches would be difficult to keep in mind _and_ difficult to cover with 
tests, so they will act as a persistent source of bugs.

>     2. The measurements weren't conclusive. There was no evidence provided 
> that on the usual cases (all entry points) the RT would not regress. It was 
> also not fair to look at only the longest 500 entry points to evaluate the 
> effectiveness of limiting the max symbol complexity (in other words, honoring 
> the max symbol complexity limit).

I don't claim that there was _no_ evidence at all, but I feel that it wasn't 
significant enough.

> > [...] [regression toward the 
> > mean](https://en.wikipedia.org/wiki/Regression_toward_the_mean) [...]
> 
> I formulated the test case in the tests by inspecting a long-running test 
> case. It is consistently low-performing. You can also check it on godbolt, 
> that it would not finish, because the symbol simplifier would need to do so 
> much work due walking the overly complicated symbols. I've also inspected 
> about 10 (a couple of months ago) other radically improved cases, and all 
> showed a large number of binary op manipulations like hashing, or the test 
> case I supply in this patch. This doesn't seem to be a coincidence to me. 
> From my experience, our pool is large enough to be consistent and roughly 
> reproducible for long-ish entry points.

This additional information reduces my fears that the measured runtime 
difference is just environmental noise. However, extreme outliers (and I'd say 
that top 500 out of 3'000'000 or even just 390'000 is extreme outlier) are 
still a very treacherous ground for statistical conclusions (they can amplify 
small noises to a surprising degree), so I would like to see a more 
comprehensive statistical analysis. If you can share the raw data (the 3 
million `{old runtime, new runtime}` pairs), I can do this myself.

> According to our data, usually an entry point should finish in about 1 second 
> if not less. Above that suggests something to look at.

Fair, although I'd emphasize that they are not "must be eliminated" bugs, but 
just "something to look at" -- which may or  may not lead to an improvement. _I 
don't think that symbol complexity is a "low-hanging fruit" for slow entry 
points -- instead of this I'd suggest investigating the heuristics related to 
inlining and inlined function size. However this is far off-topic -- I'll try 
to eventually start a discourse thread about it once I clarify my suspicions._

> To me, encountering symbols with complexity over the dedicated max symbol 
> complexity threshold is a bug.

I don't agree -- very complex symbols naturally appear as we define the symbols 
with the straightforward recursive definition that we use, and they correspond 
to well-formed expressions. You can say that symbols over a certain complexity 
threshold are so rare that the analyzer can arbitrarily discard them, and this 
may (or may not) be a useful heuristic -- but still the existence of the 
complex symbols is the "natural" bug-free state and suppressing them is the 
inaccurate behavior which complicates the situation.

> Can you think of other ways to ensure we never create overly complicated 
> symbols?

I'm not convinced that we _need_ to ensure that we never create overly 
complicated symbols. I feel that this is a "cure is worse than the disease" 
situation -- this patch introduces a significant amount of code complexity for 
modest gains in performance.

However, if the statistical analysis confirms that this is an useful direction, 
I'd suggest eliminating the complex symbols (more precisely, the states that 
contain them) in a lazy fashion: put a boolean tag on the state when a complex 
symbol is stored in it and prevent further exploration from exploded nodes that 
contain a tagged state. This way this performance optimization hack could be 
limited to the engine, and the checkers wouldn't need to know about it at all.

https://github.com/llvm/llvm-project/pull/144327
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [analyzer] Enforce not making overly complicated symbols (PR #144327)

Reply via email to