On May 26, 2009, at 11:43 AM, David Mandelin wrote:
Ted Kremenek wrote:
C++ support in Clang is rapidly progressing,
Cool. Is there a page with notes on the design? I'm curious what
approach you are using. Elsa's GLR design seems like a good
approach, but it doesn't cover all the latest complicated template
features, and the latest problems I saw seemed difficult to solve in
that design. (It was long enough ago that I don't remember the exact
problem.)
Hi David,
Clang uses a recursive descent parser design, and this applies to the C
++ portion as well. We have found that the design works really well,
and leads to a fairly clean implementation that is fairly easy to
understand and extend.
As you pointed out, the Clang documentation on the parser is lacking
(and should be improved). The design itself is simple. The parser
handles the grammar of the language, and calls back into an abstract
interface (called 'Actions') that is responsible for building up the
ASTs, performing the type checking, etc. (the implementation of this
is called 'Sema'). The abstract interface allows the parsing logic to
be (relatively) simple, and allows one to swap in a different
implementation of the abstract interface if one didn't want to do full
type-checking, etc.
but because the Clang static analyzer performs static analysis at
the source level simply having C++ parsing support does not imply
immediate support in the analyzer. Bringing that feature up will
likely require active participation from the open source community.
What about running over a language-independent IR instead? That's
the approach we've used in Treehydra and it seems like it would be
even better because I think you have much cleaner IRs in LLVM.
We definitely thought about this, and we made a deliberate choice to
analyze source code instead of LLVM IR.
Analyzing at the IR level certainly has it benefits, as complicated
language features are lowered to primitive operations, allowing one to
focus on analyzing those primitives. This is both a significant
blessing and a onerous curse, and it all comes down to the tradeoffs
employed.
Our choice of analyzing source code came down to several key factors
(in no particular order):
a) Understanding high-level interfaces.
Many complicated language features reduce to a large number of LLVM IR
instructions, but ultimately we are interested in the macroscopic
actions (e.g., the invocation of a method, which could span many LLVM
IR instructions). Many bugs have to do with reasoning about
interfaces rather than the specific low-level semantics (which are
also important, but can be approximated), and doing this at the source
level can be much easier.
Further, source often captures the programmer's intent in a far more
recognizable way than a lowered representation. Many bugs are
somewhere between the poles of syntax and semantics. For example,
sometimes a potential bug isn't really ever a bug if it occurs within
a macro, or that the code was written in a specific way indicating
that the user didn't care about the "bug". For example, consider a
dead store:
err = foo();
versus
if (err = foo())
Suppose 'err' is never read after the assignment. According the the
semantics of C, the variable 'err' isn't actually read in both cases,
but the first case is more likely to be a programming mistake than the
second (the second can also be an error if they meant to do 'err == foo
()', but that conceptually is a different kind of bug). Certainly
distinguishing between these cases can be done at the LLVM IR level,
but it is a little more tricky to do. There are also cases such as 'i+
+' and 'i = i + 1' that are potentially indistinguishable at the LLVM
IR level, but could be relevant when determining the chance that a
real bug occurred.
In other words, precisely analyzing semantics isn't always enough, and
understanding the intent of the programmer, which often boils down to
looking at syntax, is often very useful when determining whether or
not a real error is present.
b) Language types
Often high-level language types are essentially completely erased at
the LLVM IR level, being lowered to structs, etc. The high-level type
system is especially useful when one is analyzing a language with a
rich OO-type system such as Objective-C and C++. This is useful both
for reasoning about high-level interfaces (my previous point) and
thinking about virtual function calls, etc.
c) Great diagnostics
Clang's preprocessor and parser are integrated, meaning the ASTs have
full information regarding macros, pragmas, the #include stack, and so
on. Clang also has full source range information, with locations for
individual '{' tokens, etc. This allows the analyzer to report
excellent diagnostics with full column and line information, source
ranges, etc. Such rich location information also allows us to
potentially tie into code refactoring operations that could be used to
either fix bugs or to transform the code in some other useful way.
While is possible to tie much of the LLVM IR back to the original
source, this isn't always trivial as the lowering could be
architecture independent. Moreover, because some language-level
features (such as an Objective-C method invocation) lower to many
LLVM IR instructions, performing the back mapping in many cases can be
non-trivial and error prone.
d) Sometimes lying gets you closer to the truth
Precisely handling various operations such as sign-extension, bit
masking, etc., when reasoning about symbolic values can be
challenging. Instead of being perfect, I think it is easier to
approximate the truth when analyzing source code than when analyzing
LLVM IR (since operations can be broken up over many instructions).
At a high-level representation, it is often easier to understand what
is important and what is not when it comes to precisely analyzing a
fragment of code. Sometimes not handling certain details just doesn't
really matter, and in certain cases where clang's analyzer currently
doesn't handle something well we can often recover path-sensitivity by
making up new symbolic values, etc., when the result of an operation
is "too complicated" to reason about. I think this kind of cheating
is often easier to do at a high-level than when using a lowered
representation, but opinions may differ.
Of course analyzing source code can be hard. One has to reason about
arbitrary casts, short-circuit operations, etc., that all simplified
when lowered to the LLVM IR level. However, I argue that once the
core logic to handle such things is implemented, that hard work in
implementing the analyzer is elsewhere (e.g., reasoning about symbolic
values and abstracted program memory, etc.).
The clang analyzer currently does mostly local analysis,
essentially operating under the conservative approximation that the
implementation of the callee of functions/methods is unavailable
for analysis. That plan is to add more global analysis over time,
hopefully over the next year (time permitting).
We generally do unsound analysis instead (assuming the callees do
nothing, or do a little bit we can guess at, like writing to
reference-typed arguments) to cut down on false positives. Maybe the
best possible tool has a dial to tune the level of conservatism. I
have no idea what the best default for general-purpose checking is,
though.
Ah. By conservative I meant a combination of unsound and sound
approximations designed to reduce the number of false positives and
have a high signal-to-noise ratio from the analyzer. In other words I
prefer to trade off false negatives for false positives in order to
extract the most useful results.
_______________________________________________
dev-static-analysis mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-static-analysis