[Chicken-users] ANN: persistent-hash-map 0.0.1

Moritz Heidkamp Sun, 10 Feb 2013 15:32:00 -0800

Fellow Chickeneers,

I have finally gotten around to finishing my long standing plan of
porting the useful persistent hash map data structure from Clojure to
Chicken. Thankfully, ClojureScript grew its own implementation of it in
the meantime which written (mostly) in ClojureScript itself so porting
was way easier than with the original Java implementation. But first,
let me explain why this data structure is so nice. In a nutshell it
allows you to program in a functional style like you are used to with
lists in Scheme. As you know, SRFI 69 style hash tables don't lend
themselves very well to this style because they are mutable. Thus we
Schemers often resort to alists when we need associative data structures
which is fine in many cases but has (at least) two drawbacks:


1) Lookup, deletion, and insertion are fairly expensive operations of
   O(n) complexity. This can be circumvented by converting the alist
   into a hash table once it is "finished" but this requires a fair
   amount of copying. Also, from this point on you lose the (de facto)
   immutability of the alist which allowed you to program in a
   functional way. So now you have to deal with all the challenges of
   mutable state again. In conclusion you can't have your cake and eat
   it, too.

2) Alists are not of a distinct type so you can't for example dispatch
   or specialize on them. The closest approximation of a type predicate
   is of O(n) complexity, too, as it would require to walk the whole
   list to check whether all elements are indeed pairs. This is also an
   issue with serialization and deserialization as their external
   representation is indistinguishable from a list of pairs.

In practice these drawbacks are usually not much of an issue but it
strikes me as a kludgy work-aroundish state of affairs. I want to use a
proper data structure when dealing with associative data without
compromising programming convenience! And this is where persistent hash
maps come into play. In essence they work like alists without the
drawbacks mentioned above:

1) Lookup complexity is practically constant and contrived benchmarks
   show that performance is comparable to SRFI 69 hash tables. Insertion
   and deletion is around O(log(n)). As an optimization the
   implementation allows to turn a persistent map into a so called
   transient map which can be mutated. This allows for efficient batch
   modification. Later it can be turned back into a persistent map
   without any copying.

2) Maps are implemented as a distinct type. What's currently missing is
   an external representation but this can easily be added, e.g. by
   means of a SRFI 10 reader extension.

In a way, persistent hash maps give you the best of alists and hash
tables combined which is why I highly recommend you to check them
out. The egg's documentation can be found at the usual place
(http://wiki.call-cc.org/eggref/4/persistent-hash-map), the source code
lives at https://bitbucket.org/DerGuteMoritz/persistent-hash-map.  There
are almost certainly still some bugs in the implementation, especially
in edge cases. I'm happy to receive bug reports if you happen to run
into those!

With regards to the implementation I would like share a few
tidbits. Since I wanted to achieve decently performing code I took this
project as an opportunity to give Felix' programming for performance
guide a try (http://wiki.call-cc.org/programming-for-performance). It
was really helpful in reaching this goal, so thanks Felix for taking the
time to write it up! My findings are as follows:

The original ClojureScript (and Clojure) implementation uses types to
represent different kinds of nodes. For those types several generic
functions ("protocol methods" in Clojure lingo, see
http://clojure.org/protocols) are defined. I wanted to keep this
structure in my port so as to not diverge all too far from the original
implementation. That way I hope to be able to integrate upstream patches
in the future more swiftly. To that end I tried coops and fast-generic
(as recommended in the guide) and found that coops yields slightly
smaller compiled code (about 210K versus 244K with fast-generic) while
fast-generic's dispatch is about an order of magnitude faster. The catch
is that fast-generic types are only visible inside the compiled module
but in this case this doesn't matter because the generic functions are
completely internal to the implementation so I decided to go with
fast-generic. The types themselves are represented as Scheme records. I
tried both the typed-records and the record-variants extensions. It
turned out that typed-records yielded slightly better performance than
record-variants with all declarations enabled (unsafe, unchecked,
inline). The typed-records may have additionally benefited from
pervasive type declarations on all procedures defined in the module (a
.types file is provided by the egg, too). With these things I managed to
achieve quite good performance characteristics without resorting to
direct calls to any ##sys# or ##core# functions. The only possibly not
quite portable bit is a foreign call to __builtin_popcount which is
supported by GCC since version 3.4 and by LLVM since version 1.5. Some
architectures support popcount as a native instruction (e.g. Intel CPUs
since Core i7) which can be enabled by passing -mpopcnt or, for example,
-march=corei7 to the compiler via CSC_OPTIONS. If you run into problems
with that, please let me know. Also, if you have ideas how to ooze out
some more performance I'm all ears. The next step on my plan is to try
specializations. We'll see how that goes.

Cheers all around
Moritz

_______________________________________________
Chicken-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/chicken-users

[Chicken-users] ANN: persistent-hash-map 0.0.1

Reply via email to