Re: [Haskell-cafe] Data.IntMap union complexity
On 2/24/12 3:40 AM, Christoph Breitkopf wrote: On Fri, Feb 24, 2012 at 4:48 AM, wren ng thorntonw...@freegeek.org wrote: When the two maps are of vastly different sizes, O(min(m,n)) is a more intuitive way to think about it. Since you only have to look at the places where the spines clash, that will be on the order of the smaller map. Folding insert might still be a win if one of the maps is very much smaller than the other, but since size is O(n) for Data.IntMap, there's no way to find out if that's the case. It's possible (due to the constant factors I keep mentioning), though I wouldn't expect it to. The reasoning why is as follows: In the smaller map every key will have some shared prefix with other keys. (Excepting degenerate cases like the empty map and singleton maps.) As such, calling insert with two keys that share a prefix means that you'll be traversing the larger map's spine for that prefix twice. Whereas, with the merge-based implementation we only traverse once. Moreover, when performing repeated insertions, we have to reconstruct the structure of the smaller map, which can lead to producing excessive garbage over the merge-based implementation. For example, consider the case where we have keys K and L in the smaller map which diverge at some prefix P. If P does not lay in the spine of the larger map, then we will have to merge P with some other prefix Q in order to produce a Bin. Let's say that P and Q diverge at R (thus there is a Bin at R, with children P and Q). After inserting K we now have a map where there is a spine from R to K. Now, when we insert L, since we know that P lays on the spine between R and K, that means we'll have to merge P with K to produce another Bin. The spine from R to K is now garbage--- but it wouldn't have been allocated in the merge-based implementation, since K and L would have been inserted simultaneously. If it's really a concern, then I still say the best approach is to just benchmark the two options; the API gives you the tools you need. As for the O(n) size, you can always define your own data structure which memoizes the size; e.g., -- N.B., the size field is lazy and therefore only computed on -- demand, yet it is shared and so only costs O(n) the first -- time it's accessed since the map was updated. data SizedIntMap a = SIM Int !(IntMap a) size (SIM s _) = s insert k v (SIM _ m) = SIM (IM.size m') m' where m' = IM.insert k v m ... It's a lot of boilerplate, and it's be nice if that boilerplate were provided once and for all by containers (as Data.IntMap.Sized, or the like), but it's simple enough to do. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
Evan Laforge qdun...@gmail.com wrote: I've wondered if it's faster to insert many keys by successive insertion, or by building another map and then unioning, and likewise with deletion. I eventually decided on successive, thinking a log n build + merge is going to be slower than a m*log n for successive insertion. So I wound up with: If you don't already have the keys in a map, I don't think you gain much by building a map and then merging rather than just inserting them directly. It will produce extra garbage (unless you have some interest in the map you're building), and you have to make the same spine traversals in building the map as you would have inserting into the larger map (and then you have to traverse the larger map during merging). But again, thanks to Criterion, benchmarking is cheap and easy. No need to believe in the folklore or opinions of others :) Though, if the set of keys to be added is very large, then the build+merge approach would allow you to parallelize the building of the map (split the key set in half and build maps for each set, recursing as necessary; then either merge the new maps together before merging with the target map, or just merge them with the target map in serial). -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
2012/2/24 Clark Gaebel cgae...@csclub.uwaterloo.ca: Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). Or am I just crazy? This way you will create much more garbage, I think. The union of two completely disjoint maps can hold parts of them intact. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
Folding insert might still be a win if one of the maps is very much smaller than the other, but since size is O(n) for Data.IntMap, there's no way to find out if that's the case. - chris On Fri, Feb 24, 2012 at 4:48 AM, wren ng thornton w...@freegeek.org wrote: When the two maps are of vastly different sizes, O(min(m,n)) is a more intuitive way to think about it. Since you only have to look at the places where the spines clash, that will be on the order of the smaller map. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
I've wondered if it's faster to insert many keys by successive insertion, or by building another map and then unioning, and likewise with deletion. I eventually decided on successive, thinking a log n build + merge is going to be slower than a m*log n for successive insertion. So I wound up with: insert_list :: (Ord k) = [(k, v)] - Map.Map k v - Map.Map k v insert_list kvs m = List.foldl' (\m (k, v) - Map.insert k v m) m kvs delete_keys :: (Ord k) = [k] - Map.Map k a - Map.Map k a delete_keys keys fm = Map.difference fm (Map.fromList [(k, ()) | k - keys]) Oops, I guess I changed my mind by the time I got to writing 'delete_keys' :) But if the list of things to insert or delete is already sorted, then you could take advantage of fromListAsc and maybe a union would save some time? On Fri, Feb 24, 2012 at 12:38 AM, Serguey Zefirov sergu...@gmail.com wrote: 2012/2/24 Clark Gaebel cgae...@csclub.uwaterloo.ca: Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). Or am I just crazy? This way you will create much more garbage, I think. The union of two completely disjoint maps can hold parts of them intact. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Data.IntMap union complexity
Looking at IntMap's left-biased 'union' function [1], I noticed that the complexity is O(n+m) where n is the size of the left map, and m is the size of the right map. Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). Or am I just crazy? Regards, - clark [1] http://hackage.haskell.org/packages/archive/containers/0.4.2.1/doc/html/Data-IntMap.html#v:union [2] http://hackage.haskell.org/packages/archive/containers/0.4.2.1/doc/html/Data-IntMap.html#v:insert ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
Hello, Looking at IntMap's left-biased 'union' function [1], I noticed that the complexity is O(n+m) where n is the size of the left map, and m is the size of the right map. Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). Interesting. I would point out that the original paper Fast Mergeable Integer Maps says that merge is O(n+m). I don't know which one is correct. --Kazu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
On 2/23/12 9:16 PM, Clark Gaebel wrote: Looking at IntMap's left-biased 'union' function [1], I noticed that the complexity is O(n+m) where n is the size of the left map, and m is the size of the right map. Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). The important things to bear in mind here are (1) the constant factors actually matter in practice, and (2) what's actually going on. While O(min(n,W)) is correct, it's incorrect to think about it as just a constant (or as just a linear function). While technically incorrect, it's better to think of it as O(log n) in order to get an intuition for how it works. And O(m+n) is much nicer than O(m*log n). Doing a fold with insert means that we must pay for the cost of traversing one of the maps entirely, and the cost of walking the spine for a lookup/insert m times. Whereas, with the merge function we only have to traverse the portions of the spines which intersect, and we only have to do it in one pass. In doing the fold, we're essentially ignoring the fact that the maps have a trie structure, since we have to traverse from the top for every insert; whereas for the merge, we make use of the structure in order to avoid redundant traversals of the top part of the structure. Thus, the merge is doing less work. So, in theory, it should be faster. However, again, the thing to beware of is the constant factors. In particular, big-O algorithmic analysis doesn't really account for things like locality and cache coherence, so one should always be on the lookout for places where duplicating work is actually faster in practice. If you're curious, you can always implement your own union using the fold-with-insert method and then run some benchmarks. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
The situation I encounted this is doing a batch update of a map. Is there an easy way to do that? I'm doing something like adding 20-or-so elements to an existing map of a few thousand. On Thu, Feb 23, 2012 at 10:13 PM, wren ng thornton w...@freegeek.orgwrote: On 2/23/12 9:16 PM, Clark Gaebel wrote: Looking at IntMap's left-biased 'union' function [1], I noticed that the complexity is O(n+m) where n is the size of the left map, and m is the size of the right map. Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int ], wouldn't it be more efficient to just fold 'insert' over one of the lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst case, as opposed to the current O(n+m). The important things to bear in mind here are (1) the constant factors actually matter in practice, and (2) what's actually going on. While O(min(n,W)) is correct, it's incorrect to think about it as just a constant (or as just a linear function). While technically incorrect, it's better to think of it as O(log n) in order to get an intuition for how it works. And O(m+n) is much nicer than O(m*log n). Doing a fold with insert means that we must pay for the cost of traversing one of the maps entirely, and the cost of walking the spine for a lookup/insert m times. Whereas, with the merge function we only have to traverse the portions of the spines which intersect, and we only have to do it in one pass. In doing the fold, we're essentially ignoring the fact that the maps have a trie structure, since we have to traverse from the top for every insert; whereas for the merge, we make use of the structure in order to avoid redundant traversals of the top part of the structure. Thus, the merge is doing less work. So, in theory, it should be faster. However, again, the thing to beware of is the constant factors. In particular, big-O algorithmic analysis doesn't really account for things like locality and cache coherence, so one should always be on the lookout for places where duplicating work is actually faster in practice. If you're curious, you can always implement your own union using the fold-with-insert method and then run some benchmarks. -- Live well, ~wren __**_ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Data.IntMap union complexity
On 2/23/12 10:22 PM, Clark Gaebel wrote: The situation I encounted this is doing a batch update of a map. Is there an easy way to do that? I'm doing something like adding 20-or-so elements to an existing map of a few thousand. The O(m+n) of the merging functions is actually on the order of the size of the spine intersection between the two maps. Thus, it only ever approaches that limit in cases where the maps are of approximately equal size and are designed for the spines to clash in the worst way. An example would be if you're merging the set of all even numbers and the set of all odd numbers; you have to traverse the whole tree up to the last bit. Ditto for taking the union of a set with itself. When the two maps are of vastly different sizes, O(min(m,n)) is a more intuitive way to think about it. Since you only have to look at the places where the spines clash, that will be on the order of the smaller map. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe