Hi Ben, I have a fairly long response to this one. I apologize in advance if it comes off as a ramble!
On Feb 8, 2009, at 6:57 AM, Ben Laurie wrote: >> It also looks like you are modifying >> the RangeSet object in place. In the static analyzer, we use >> functional >> data structures that preserve old data so that each GRState object >> (which >> represents the analysis state at a particular point within a path) >> never >> have their values mutated after the GRState object is created. >> This allows >> us to faithfully keep full path information. This allows us to >> report full >> path diagnostics when reporting bugs but also (from a correctness >> standpoint) allows path caching/merging to work at all. > > Wow. Doesn't it turn out to be horribly expensive? Potentially. I'll elaborate on some key design points. First off, the interface to the public methods of ConstraintManager are fairly straightforward. All data coming in is immutable, and all data going out is immutable. It essentially follows a functional programming design. Second, per your comment on keeping all state values around being "horribly expensive", path-sensitive static analysis algorithms, where we reason precisely about individual paths, have worst case exponential time complexity. This means that while constant factors remain important, real scalability comes with clever algorithms that reduce the analysis complexity for most practical inputs. We employ a bunch of tricks to keep the time complexity low (and bounded) while also keeping the space complexity bounded as well. While there are many ways to do path-sensitive analysis, the approach we are using is essentially symbolic execution. Each path is "simulated" using abstract program values. The collection of these values, which represents bindings between variables and abstract data, is called a "state" (represented in the analyzer by the object GRState). As we simulate a path, we evaluate the effects of an expression and update the values of the state. Because we are reasoning about symbolic values, at branches it is often the case that both branch conditions are feasible, with the path splitting into two paths at the branch. If GRState is an imperative data structure, such as std::set or std::map, we essentially would need to make a copy of GRState for each branch and then continue simulating along either path. We could do this simulation in a DFS fashion through the function's control-flow graph, gradually updating the state along a path until we hit (a) the end of the function or (b) a bug that for all intensive purposes ends simulation along that path. Once we hit the end of the path, we could simulate an alternate path resulting from a split at a branch earlier in the simulation. The algorithm I described has exponential time complexity, as the number of paths potentially doubles at each branch. One natural optimization is to stop execution of a path when you hit the same "state" at the same location within a function. This can be done because we are reasoning about symbolic values, and thus if we hit a location with the same state that we encountered along a different path we can just "cache out" because all further simulation would be the same as the other path. While intuitively simple, this optimization, when done right, can transform the exponential search algorithm into something that is quite reasonable in practice. One could employ this caching approach when using imperative state values, but it gets tricky. It means that one needs efficient ways to compare a current state against the state of previously analyzed paths. This either requires keeping full state information around to do this comparison precisely, or to do an approximate solution such as hashing state values and comparing states using their hashes. In the analyzer, we keep full state information around, allowing us to accurately compare paths against previously explored paths. This eliminates subtle bugs caused by the approximate comparison of states and also means we have a great deal of information around when decided when two paths are "the same." (equality is only one way to compare states; one can also reason about how one state "subsumes" another). Another reason why keeping full state values around is for error reporting. Using an imperative approach (where we mutate a state in place) causes us to forget the state of the simulation along a path. If we do a DFS traversal of the state graph, we could potentially keep all the states along a particular path around (e.g., by making copies of the state at each subsequent expression) and if we encounter a bug we can present the full path information. Making copies of states isn't very efficient, and if you only record the diffs you are essentially using a functional data structure (more on this later). If you want to find the shortest path to a bug (which can be very helpful when presenting a bug to a user), ultimately you need to reason about the data along multiple paths. This can become expensive if one doesn't employ the right data structures that exploits the commonality between states. The way Clang's static analysis engine does path exploration is by using functional data structures for its state. Each GRState object, once created, is immutable. When you do the processing to explore a new path, you will never trounce the data you have accumulated for paths you have already explored. From a correctness perspective, this makes it much easier to write complex analyses. You never are in risk of invalidating iterators and scribbling on state you meant to keep. Everything is reasoned in terms of input and output and rarely, if ever, side-effects. Further, the functional data structures we employ share most data, so you are really only recording the diffs between states (which is very efficient). Note that we only keep full path information on a per-function basis. When doing inter-procedural analysis, we will need to employ something like a summary-based approach (where we build up pre- and post- condition models for functions) that allow us to reason about inter- procedural paths. To summarize, here are a few big points of why keeping the full simulation state (for a function) is useful: (a) It is *extremely* useful for diagnosing bugs by being able to render full information concerning a path to a user. With the full state graph, we can tell users all paths where the same bug (e.g., a leak) occurs, or use that information to determine how likely a candidate bug is real by examining the nature of the paths it occurs on. More generally, information thrown away is very difficult to recover. (b) Functional data structures use a lot of data sharing, so the incremental cost of saving this extra information is pretty minimal and very efficient (i.e., most state stays the same between evaluated expressions). (c) Meta-level checkers can be written on top of the explored state space graph. (d) We can do some very elaborate caching/pruning of paths by explicitly recording the explored state space graph. (e) The invariants are simple: don't scribble on old data. This makes reasoning about the correctness of the analyzer more "inductive" and compositional. I cannot stress how important this is, as the nature of path-sensitive analysis quickly gets complicated enough that without such compositional reasoning there is little hope and getting things right. Prior to working the static analysis component of Clang, I worked on a few different static analysis systems, many based on the approach of symbolic execution. I generally found that implementing these tools in imperative languages like C often forced an implementation based on imperative thinking where a bunch of information was always thrown away because it was deemed "too hard" or "too expensive" to keep around. Later, when I implemented some of the same concepts using functional languages (e.g., OCaml), I found that because most of the symbolic information between states is usually the same it was (a) cheap and easy to keep full state information around and (b) it made the implementation much simpler and more resilient because all data was immutable. When I started work on the analyzer in Clang, I wrote the ImmutableMap and ImmutableSet data structures (which were inspired by the Map and Set modules in OCaml) so that we could harness these concepts in an analyzer written in C++. While I have not done extensive performance evaluation, so far the approach appears to work well, and the core of the analyzer was able to be brought up in a fairly short time. The data structures ImmutableMap and ImmutableSet use a fast pool allocator to allocate all shared state (i.e., the maps and sets) created by a single "factory" object. Once one is done analyzing a function, all of that memory is reclaimed immediately without running any destructors, calling 'delete' or 'free' on each object, etc. The ImmutableSet and ImmutableMap classes also provide referential canonicalization. For example, creating an ImmutableMap where I first insert "3" and then "4" leaves me with the same map (as in they can be compared using a single pointer comparison to see if they are the same) if I did an insert of "4" and "3" with any number of removals or insertions in between. Finally, the use of functional data structures and functional reasoning in the analyzer also opens up some long term possibilities. The analyzer can be potentially be highly parallelized, with different paths within a function explored simultaneously on different threads. Because data is immutable, one can worry substantially less about one thread trouncing on the data and invariants of another thread. I apologize if this was a bit of a ramble, and I would be happy to talk about more specific points. Ted _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
