On Feb 9, 2009, at 10:50 PM, Zhongxing Xu wrote: > Hi Ted, > > What approach did you use to implement the immutable data > structures? Node copying? Fat node? I am trying to read that code. >
Hi Zhongxing, I'm assuming you are referring to ImmutableSet and ImmutableMap. ImmutableList has a really simple implementation, with new lists created by 'consing' them with old lists. I'll be honest that I'm not certain what you mean by the term "node copying" and "fat node"; I can extrapolate, but they don't immediately mean something definitive to me. Both ImmutableSet and ImmutableMap are built on top a purely functional balanced binary tree. Chris Okasaki's excellent book on "Purely Functional Data Structures" has a good description of the algorithms and methods for implementing a functional red-black tree. Instead I used a functional AVL tree, with the algorithm modeled directly on the one used for Objective Caml's "Map" module. You can basically look at the Map.ml file in a recent OCaml distribution and see a much more concise (and readable) implementation. Each node in the functional AVL tree is immutable. It consists of pointers to left and right children as well as the 'data' associated with the node. Once a node is created it is never changed; it's children stay fixed as well as its data. Children do not have back- pointers to their parents. Indeed this doesn't make sense, as a node can be shared between multiple trees as different parents share common children. Because there are no back pointers, iterators over a functional AVL tree must keep their own "recursion stack" for going down and up the edges of the tree. For the ImmutableMap the data is a key-value pair and for ImmutableSet it is a single object. All data is basically considered to be POD, making ImmutableSet/ImmutableMap best used with things like pointers or small values. Much of the template magic in the AVL tree implementation has to do with handling the comparisons between nodes (where in the ImmutableMap case we want to compare keys and the ImmutableSet case we want to compare the data itself), so the implementation of the AVL tree gets muddled based on the fact it is used by multiple clients. The most important operations for the functional AVL tree is adding and removing entries. Both cases are written using recursive algorithms, which recursively dive into the tree to find the spot where the data should be placed or removed. In the case where the data doesn't exist in the tree, a "remove" operation just returns the original tree. For the "add" operation, a set of new trees are created, adding pointers to existing subtrees as the recursion works its way back up to the root. At the top a new root is created that shares some of the existing data of the original tree as well as a pointer to the new subtree containing the new data. Starting with a balanced tree, without rebalancing an "add" or "removal" results in log(N) new nodes being created, where N is the number of nodes in the original tree and log(N) is its height. Essentially it creates a new "path" from the new root to the subtree containing the data element. With rebalancing, additional nodes may be created. Rebalancing is done recursively, with the bottom being rebalanced first. The values of nodes are copied (which I guess would be considered "node copying") to new nodes to do the rebalancing. Thus new subtrees are created in the process to create a new aggregate tree that satisfies the properties of an AVL tree. Note that the recursive nature of the rebalancing operation can cause some nodes to become completely unused (i.e., they can be freshly created and then have no parents point to them because the rebalancing at a higher level in the rebalancing recursion causes them to get discarded). Because OCaml uses GC, their map implementation can create nodes that are lost; these are reclaimed by GC. We instead allocate nodes using a BumpPtrAllocator (a pool allocator), which only reclaims memory it allocates all at once. To avoid just leaking nodes like made, there is a "mutable" flag in each AVL node which indicates whether or not that node has "escaped" outside the confines of the current add/remove operation. When the 'Add'/'Remove' operation returns, all nodes are marked immutable, but during the add/remove operation orphaned nodes that are still marked mutable are reused during the rebalancing. As an added level of complexity, each node contains a hash digest which represents the hash of all its data and the data in its subtrees. This is used to unique trees within a FoldingSet. The result is that ImmutableMap::Factory and ImmutableSet::Factory will return the same exact map/set objects (with referential equality) for the same logical maps and sets. This is used extensively in the static analyzer to avoid profiling entire sets and maps over and over. Note that there are plenty of places that memory gets "leaked". Nodes can get leaked when nothing points to them anymore, which can happen during rebalancing or when using the digests to fetch the "canonical" tree for a given collection of data. For example, if a set of new subtrees get created while doing add and remove operations and (after several of these operations) we get a tree with the same digest as an existing tree those other trees will get orphaned. This isn't so terrible since all of this memory will get reclaimed when the BumpPtrAllocator's destructor gets called, but it is something to consider if we find memory pressure to be an issue. I'm sure that there are many details that I am leaving out; feel free to follow up with specific questions! Ted _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
