Re: "clang" static analysis

Ted Kremenek Tue, 26 May 2009 12:25:23 -0700


On May 26, 2009, at 11:43 AM, David Mandelin wrote:

Ted Kremenek wrote:
C++ support in Clang is rapidly progressing,
Cool. Is there a page with notes on the design? I'm curious whatapproach you are using. Elsa's GLR design seems like a goodapproach, but it doesn't cover all the latest complicated templatefeatures, and the latest problems I saw seemed difficult to solve inthat design. (It was long enough ago that I don't remember the exactproblem.)


Hi David,

Clang uses a recursive descent parser design, and this applies to the C++ portion as well. We have found that the design works really well,and leads to a fairly clean implementation that is fairly easy tounderstand and extend.

As you pointed out, the Clang documentation on the parser is lacking(and should be improved). The design itself is simple. The parserhandles the grammar of the language, and calls back into an abstractinterface (called 'Actions') that is responsible for building up theASTs, performing the type checking, etc. (the implementation of thisis called 'Sema'). The abstract interface allows the parsing logic tobe (relatively) simple, and allows one to swap in a differentimplementation of the abstract interface if one didn't want to do fulltype-checking, etc.

but because the Clang static analyzer performs static analysis atthe source level simply having C++ parsing support does not implyimmediate support in the analyzer. Bringing that feature up willlikely require active participation from the open source community.
What about running over a language-independent IR instead? That'sthe approach we've used in Treehydra and it seems like it would beeven better because I think you have much cleaner IRs in LLVM.

We definitely thought about this, and we made a deliberate choice toanalyze source code instead of LLVM IR.

Analyzing at the IR level certainly has it benefits, as complicatedlanguage features are lowered to primitive operations, allowing one tofocus on analyzing those primitives. This is both a significantblessing and a onerous curse, and it all comes down to the tradeoffsemployed.

Our choice of analyzing source code came down to several key factors(in no particular order):


a) Understanding high-level interfaces.

Many complicated language features reduce to a large number of LLVM IRinstructions, but ultimately we are interested in the macroscopicactions (e.g., the invocation of a method, which could span many LLVMIR instructions). Many bugs have to do with reasoning aboutinterfaces rather than the specific low-level semantics (which arealso important, but can be approximated), and doing this at the sourcelevel can be much easier.

Further, source often captures the programmer's intent in a far morerecognizable way than a lowered representation. Many bugs aresomewhere between the poles of syntax and semantics. For example,sometimes a potential bug isn't really ever a bug if it occurs withina macro, or that the code was written in a specific way indicatingthat the user didn't care about the "bug". For example, consider adead store:


  err = foo();

versus

  if (err = foo())

Suppose 'err' is never read after the assignment. According the thesemantics of C, the variable 'err' isn't actually read in both cases,but the first case is more likely to be a programming mistake than thesecond (the second can also be an error if they meant to do 'err == foo()', but that conceptually is a different kind of bug). Certainlydistinguishing between these cases can be done at the LLVM IR level,but it is a little more tricky to do. There are also cases such as 'i++' and 'i = i + 1' that are potentially indistinguishable at the LLVMIR level, but could be relevant when determining the chance that areal bug occurred.

In other words, precisely analyzing semantics isn't always enough, andunderstanding the intent of the programmer, which often boils down tolooking at syntax, is often very useful when determining whether ornot a real error is present.


b) Language types

Often high-level language types are essentially completely erased atthe LLVM IR level, being lowered to structs, etc. The high-level typesystem is especially useful when one is analyzing a language with arich OO-type system such as Objective-C and C++. This is useful bothfor reasoning about high-level interfaces (my previous point) andthinking about virtual function calls, etc.


c) Great diagnostics

Clang's preprocessor and parser are integrated, meaning the ASTs havefull information regarding macros, pragmas, the #include stack, and soon. Clang also has full source range information, with locations forindividual '{' tokens, etc. This allows the analyzer to reportexcellent diagnostics with full column and line information, sourceranges, etc. Such rich location information also allows us topotentially tie into code refactoring operations that could be used toeither fix bugs or to transform the code in some other useful way.While is possible to tie much of the LLVM IR back to the originalsource, this isn't always trivial as the lowering could bearchitecture independent. Moreover, because some language-levelfeatures (such as an Objective-C method invocation) lower to manyLLVM IR instructions, performing the back mapping in many cases can benon-trivial and error prone.


d) Sometimes lying gets you closer to the truth

Precisely handling various operations such as sign-extension, bitmasking, etc., when reasoning about symbolic values can bechallenging. Instead of being perfect, I think it is easier toapproximate the truth when analyzing source code than when analyzingLLVM IR (since operations can be broken up over many instructions).At a high-level representation, it is often easier to understand whatis important and what is not when it comes to precisely analyzing afragment of code. Sometimes not handling certain details just doesn'treally matter, and in certain cases where clang's analyzer currentlydoesn't handle something well we can often recover path-sensitivity bymaking up new symbolic values, etc., when the result of an operationis "too complicated" to reason about. I think this kind of cheatingis often easier to do at a high-level than when using a loweredrepresentation, but opinions may differ.

Of course analyzing source code can be hard. One has to reason aboutarbitrary casts, short-circuit operations, etc., that all simplifiedwhen lowered to the LLVM IR level. However, I argue that once thecore logic to handle such things is implemented, that hard work inimplementing the analyzer is elsewhere (e.g., reasoning about symbolicvalues and abstracted program memory, etc.).

The clang analyzer currently does mostly local analysis,essentially operating under the conservative approximation that theimplementation of the callee of functions/methods is unavailablefor analysis. That plan is to add more global analysis over time,hopefully over the next year (time permitting).
We generally do unsound analysis instead (assuming the callees donothing, or do a little bit we can guess at, like writing toreference-typed arguments) to cut down on false positives. Maybe thebest possible tool has a dial to tune the level of conservatism. Ihave no idea what the best default for general-purpose checking is,though.

Ah. By conservative I meant a combination of unsound and soundapproximations designed to reduce the number of false positives andhave a high signal-to-noise ratio from the analyzer. In other words Iprefer to trade off false negatives for false positives in order toextract the most useful results.

_______________________________________________
dev-static-analysis mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-static-analysis

Re: "clang" static analysis

Reply via email to