do you know if they compare with squeak because levent did a lot of work in that area.
Stef On Sep 13, 2010, at 2:55 PM, Adrian Lienhard wrote: > This mail is quite long as I copied it from a conversation with Toon > Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio > (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that > the new implementation is significantly faster than the one in Pharo 1.1. The > source code is available from www.squeaksource.com/p.html (they haven't set a > license, but I'm pretty sure they'll release their code as MIT). > > I post this here to find somebody knowledgeable in the area to take a look at > the code and figure out whether this implementation would be interesting for > Pharo... > > Cheers, > Adrian > > > citing Toon: > > We ran some benchmarks on our data structures (It's in comparison with and > benchmarked in standard Pharo 1.1). We just add / remove integers within a > certain range thus always perfectly distributed making it ideal for Pharo > dictionaries. > > The numbers show how long a certain benchmark took on average in 20 runs of > the benchmark and its standard deviation over those runs. A single benchmark > run generally consists of many individual operations on the collection in > question. For example: > > benchIncludes > 1 to: set size * 2 do: [ :i| > set includes: i ] > > where the set has size 10000. Evaluating this once results in a single run. > So the number at the top of the benchmark is the sum of all averages with the > average stdev. > > The results: > > PDictionary: 15.8 +/-1.7 > Do 0.472 +/-0.079 > RemoveKey 0.010 +/-0.021 > AtPut 0.947 +/-0.02 > AtIfAbsentPut 1.64 +/-0.96 > AtPutExisting 0.198 +/-0.011 > KeysAndValuesDo 0.502 +/-0.09 > IncludesKey 0.365 +/-0.024 > AtPutNew 4.2 +/-1.3 > Includes 7.40 +/-0.33 > > Dictionary: 138.2 +/-4.7 > Do 0.830 +/-0.098 > RemoveKey 10.292 +/-0.077 > AtPut 0.957 +/-0.044 > AtIfAbsentPut 1.52 +/-0.95 > AtPutExisting 0.203 +/-0.011 > KeysAndValuesDo 0.850 +/-0.096 > IncludesKey 80.25 +/-0.25 > AtPutNew 3.5 +/-1.3 > Includes 39.8 +/-4.4 > > PSet: 1.767 +/-0.057 > Remove 0.008 +/-0.018 > Add 0.845 +/-0.039 > AddExisting 0.310 +/-0.021 > AddNew 0.240 +/-0.021 > Includes 0.365 +/-0.024 > > Set: 70.9 +/-1.2 > Remove 7.737 +/-0.051 > Add 0.755 +/-0.022 > AddExisting 0.305 +/-0.015 > AddNew 2.473 +/-0.03 > Includes 59.6 +/-1.2 > > > Obviously the very slight differences have to be taken with a grain of salt, > but whenever it's more than 10x faster it's quite unlikely that it's due to > some random runtime variation that might totally change in another run. > > An important thing for our datastructures is that they don't degrade since we > don't have colliding hashes; but we didn't really test that yet (although it > becomes obvious in some of the benchmark results such as removing of elements > and testing presence of elements). > >> Is there anything special that one would neet to consider when replacing the >> old implementation in Pharo with yours? > > What doesn't happen at the moment is shrinking dictionaries after they have > grown significantly. I wouldn't know at what point to do so either; nor do I > think Dictionary does that? > > There is a parameter indicating how many elements can be added before it > switches from a SmallDictionary version to a dictionary with buckets. This is > set to 20 by default, and I have no idea if it makes sense to change it; I > didn't really profile what the optimal number is. Maybe it makes sense to > make that a constant in the code rather than an instance variable to save a > bit of space and time ... > > We have our own PHashedCollection subclassing from Collection; so the biggest > part of the hierarchy is compatible with what Pharo has. Mostly while porting > Pinocchio-specific stuff will have to be removed; mostly annotations related > to Pinocchio Primitives and exporting directives. This is all straightforward > though. > > The current "ratio" of hashes to elements is set to 500%. So basically when > you have 32 buckets it will only grow when it contains 160 elements (or > key-value pairs), so 5 elements per bucket on average. This seems to not make > it slower which is understandable since SmallDictionary is pretty fast for > small amounts of elements; but this ratio might have to be tweaked depending > on the kind of objects and is currently a parameter. The advantage of using > 500% is that you have A LOT less garbage. > > Oh, and our dictionaries do use a lot less garbage, since Pharo's > dictionaries use association objects for each key-value pair. We don't, so we > never generate garbage when you remove an object. > Our Sets on the other hand do use a bit more memory because of the bucket > scheme. Sets don't use associations so we don't have an edge there :) Only > performance-wise in that case. > > The hashes are masked with the amount of buckets - 1, since the buckets are > always a power of 2. > This implies that the lower bits of the hash are the most significant ones. > This does not work together with the current identityHash / > IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). > This would make all the lower 18 bits 0 and all objects thus ending up in the > first bucket. > > That is a very important thing to note enough. However, I also think it's a > bit strange to just bitshift the identityHash by 18. You don't win any > significant numbers with it ... you have 12bit hashes; no way to improve that > (without changing the image format). Since we overfill our dictionaries > however this works quite ok with those bad hashes; and we'll only start > wasting a bit of space after you have added 20480 elements to the dictionary > (you have 4096 different hashes with 12 bits, * 5). > > > > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
