Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-25 Thread wren ng thornton

On 2/24/12 3:40 AM, Christoph Breitkopf wrote:

On Fri, Feb 24, 2012 at 4:48 AM, wren ng thorntonw...@freegeek.org  wrote:

When the two maps are of vastly different sizes, O(min(m,n)) is a more
intuitive way to think about it. Since you only have to look at the places
where the spines clash, that will be on the order of the smaller map.


 Folding insert might still be a win if one of the maps is very much 

smaller

than the other, but since size is O(n) for Data.IntMap, there's no way to
find out if that's the case.


It's possible (due to the constant factors I keep mentioning), though I 
wouldn't expect it to. The reasoning why is as follows:


In the smaller map every key will have some shared prefix with other 
keys. (Excepting degenerate cases like the empty map and singleton 
maps.) As such, calling insert with two keys that share a prefix means 
that you'll be traversing the larger map's spine for that prefix twice. 
Whereas, with the merge-based implementation we only traverse once.


Moreover, when performing repeated insertions, we have to reconstruct 
the structure of the smaller map, which can lead to producing excessive 
garbage over the merge-based implementation. For example, consider the 
case where we have keys K and L in the smaller map which diverge at some 
prefix P. If P does not lay in the spine of the larger map, then we will 
have to merge P with some other prefix Q in order to produce a Bin. 
Let's say that P and Q diverge at R (thus there is a Bin at R, with 
children P and Q). After inserting K we now have a map where there is a 
spine from R to K. Now, when we insert L, since we know that P lays on 
the spine between R and K, that means we'll have to merge P with K to 
produce another Bin. The spine from R to K is now garbage--- but it 
wouldn't have been allocated in the merge-based implementation, since K 
and L would have been inserted simultaneously.



If it's really a concern, then I still say the best approach is to just 
benchmark the two options; the API gives you the tools you need. As for 
the O(n) size, you can always define your own data structure which 
memoizes the size; e.g.,


-- N.B., the size field is lazy and therefore only computed on
-- demand, yet it is shared and so only costs O(n) the first
-- time it's accessed since the map was updated.
data SizedIntMap a = SIM Int !(IntMap a)

size (SIM s _) = s

insert k v (SIM _ m) = SIM (IM.size m') m'
where m' = IM.insert k v m

...

It's a lot of boilerplate, and it's be nice if that boilerplate were 
provided once and for all by containers (as Data.IntMap.Sized, or the 
like), but it's simple enough to do.


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-25 Thread wren ng thornton

Evan Laforge qdun...@gmail.com wrote:
 I've wondered if it's faster to insert many keys by successive
 insertion, or by building another map and then unioning, and likewise
 with deletion.  I eventually decided on successive, thinking a log n
 build + merge is going to be slower than a m*log n for successive
 insertion.  So I wound up with:

If you don't already have the keys in a map, I don't think you gain much 
by building a map and then merging rather than just inserting them 
directly. It will produce extra garbage (unless you have some interest 
in the map you're building), and you have to make the same spine 
traversals in building the map as you would have inserting into the 
larger map (and then you have to traverse the larger map during 
merging). But again, thanks to Criterion, benchmarking is cheap and 
easy. No need to believe in the folklore or opinions of others :)


Though, if the set of keys to be added is very large, then the 
build+merge approach would allow you to parallelize the building of the 
map (split the key set in half and build maps for each set, recursing 
as necessary; then either merge the new maps together before merging 
with the target map, or just merge them with the target map in serial).


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-24 Thread Serguey Zefirov
2012/2/24 Clark Gaebel cgae...@csclub.uwaterloo.ca:
 Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an
 Int ], wouldn't it be more efficient to just fold 'insert' over one of the
 lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in
 the worst case, as opposed to the current O(n+m).

 Or am I just crazy?

This way you will create much more garbage, I think. The union of two
completely disjoint maps can hold parts of them intact.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-24 Thread Christoph Breitkopf
Folding insert might still be a win if one of the maps is very much smaller
than the other, but since size is O(n) for Data.IntMap, there's no way to
find out if that's the case.

- chris

On Fri, Feb 24, 2012 at 4:48 AM, wren ng thornton w...@freegeek.org wrote:

 When the two maps are of vastly different sizes, O(min(m,n)) is a more
 intuitive way to think about it. Since you only have to look at the places
 where the spines clash, that will be on the order of the smaller map.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-24 Thread Evan Laforge
I've wondered if it's faster to insert many keys by successive
insertion, or by building another map and then unioning, and likewise
with deletion.  I eventually decided on successive, thinking a log n
build + merge is going to be slower than a m*log n for successive
insertion.  So I wound up with:

insert_list :: (Ord k) = [(k, v)] - Map.Map k v - Map.Map k v
insert_list kvs m = List.foldl' (\m (k, v) - Map.insert k v m) m kvs

delete_keys :: (Ord k) = [k] - Map.Map k a - Map.Map k a
delete_keys keys fm = Map.difference fm (Map.fromList [(k, ()) | k - keys])

Oops, I guess I changed my mind by the time I got to writing 'delete_keys' :)

But if the list of things to insert or delete is already sorted, then
you could take advantage of fromListAsc and maybe a union would save
some time?

On Fri, Feb 24, 2012 at 12:38 AM, Serguey Zefirov sergu...@gmail.com wrote:
 2012/2/24 Clark Gaebel cgae...@csclub.uwaterloo.ca:
 Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an
 Int ], wouldn't it be more efficient to just fold 'insert' over one of the
 lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in
 the worst case, as opposed to the current O(n+m).

 Or am I just crazy?

 This way you will create much more garbage, I think. The union of two
 completely disjoint maps can hold parts of them intact.

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Data.IntMap union complexity

2012-02-23 Thread Clark Gaebel
Looking at IntMap's left-biased 'union' function [1], I noticed that the
complexity is O(n+m) where n is the size of the left map, and m is the size
of the right map.

Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an
Int ], wouldn't it be more efficient to just fold 'insert' over one of the
lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in
the worst case, as opposed to the current O(n+m).

Or am I just crazy?

Regards,
  - clark

[1]
http://hackage.haskell.org/packages/archive/containers/0.4.2.1/doc/html/Data-IntMap.html#v:union
[2]
http://hackage.haskell.org/packages/archive/containers/0.4.2.1/doc/html/Data-IntMap.html#v:insert
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-23 Thread 山本和彦
Hello,

 Looking at IntMap's left-biased 'union' function [1], I noticed that the
 complexity is O(n+m) where n is the size of the left map, and m is the size of
 the right map.
 
 Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an Int
 ], wouldn't it be more efficient to just fold 'insert' over one of the lists
 for a complexity of O(m*min(n, W))? This would degrade into O(m) in the worst
 case, as opposed to the current O(n+m).

Interesting.

I would point out that the original paper Fast Mergeable Integer
Maps says that merge is O(n+m). 

I don't know which one is correct.

--Kazu

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-23 Thread wren ng thornton

On 2/23/12 9:16 PM, Clark Gaebel wrote:

Looking at IntMap's left-biased 'union' function [1], I noticed that the
complexity is O(n+m) where n is the size of the left map, and m is the size
of the right map.

Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an
Int ], wouldn't it be more efficient to just fold 'insert' over one of the
lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in
the worst case, as opposed to the current O(n+m).


The important things to bear in mind here are (1) the constant factors 
actually matter in practice, and (2) what's actually going on. While 
O(min(n,W)) is correct, it's incorrect to think about it as just a 
constant (or as just a linear function). While technically incorrect, 
it's better to think of it as O(log n) in order to get an intuition for 
how it works. And O(m+n) is much nicer than O(m*log n).


Doing a fold with insert means that we must pay for the cost of 
traversing one of the maps entirely, and the cost of walking the spine 
for a lookup/insert m times. Whereas, with the merge function we only 
have to traverse the portions of the spines which intersect, and we only 
have to do it in one pass. In doing the fold, we're essentially ignoring 
the fact that the maps have a trie structure, since we have to traverse 
from the top for every insert; whereas for the merge, we make use of the 
structure in order to avoid redundant traversals of the top part of the 
structure.


Thus, the merge is doing less work. So, in theory, it should be faster. 
However, again, the thing to beware of is the constant factors. In 
particular, big-O algorithmic analysis doesn't really account for things 
like locality and cache coherence, so one should always be on the 
lookout for places where duplicating work is actually faster in 
practice. If you're curious, you can always implement your own union 
using the fold-with-insert method and then run some benchmarks.


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-23 Thread Clark Gaebel
The situation I encounted this is doing a batch update of a map. Is there
an easy way to do that? I'm doing something like adding 20-or-so elements
to an existing map of a few thousand.

On Thu, Feb 23, 2012 at 10:13 PM, wren ng thornton w...@freegeek.orgwrote:

 On 2/23/12 9:16 PM, Clark Gaebel wrote:

 Looking at IntMap's left-biased 'union' function [1], I noticed that the
 complexity is O(n+m) where n is the size of the left map, and m is the
 size
 of the right map.

 Since insertion [2] is O(min(n, W)) [ where W is the number of bits in an
 Int ], wouldn't it be more efficient to just fold 'insert' over one of the
 lists for a complexity of O(m*min(n, W))? This would degrade into O(m) in
 the worst case, as opposed to the current O(n+m).


 The important things to bear in mind here are (1) the constant factors
 actually matter in practice, and (2) what's actually going on. While
 O(min(n,W)) is correct, it's incorrect to think about it as just a constant
 (or as just a linear function). While technically incorrect, it's better to
 think of it as O(log n) in order to get an intuition for how it works. And
 O(m+n) is much nicer than O(m*log n).

 Doing a fold with insert means that we must pay for the cost of traversing
 one of the maps entirely, and the cost of walking the spine for a
 lookup/insert m times. Whereas, with the merge function we only have to
 traverse the portions of the spines which intersect, and we only have to do
 it in one pass. In doing the fold, we're essentially ignoring the fact that
 the maps have a trie structure, since we have to traverse from the top for
 every insert; whereas for the merge, we make use of the structure in order
 to avoid redundant traversals of the top part of the structure.

 Thus, the merge is doing less work. So, in theory, it should be faster.
 However, again, the thing to beware of is the constant factors. In
 particular, big-O algorithmic analysis doesn't really account for things
 like locality and cache coherence, so one should always be on the lookout
 for places where duplicating work is actually faster in practice. If you're
 curious, you can always implement your own union using the fold-with-insert
 method and then run some benchmarks.

 --
 Live well,
 ~wren

 __**_
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Data.IntMap union complexity

2012-02-23 Thread wren ng thornton

On 2/23/12 10:22 PM, Clark Gaebel wrote:

The situation I encounted this is doing a batch update of a map. Is there
an easy way to do that? I'm doing something like adding 20-or-so elements
to an existing map of a few thousand.


The O(m+n) of the merging functions is actually on the order of the size 
of the spine intersection between the two maps. Thus, it only ever 
approaches that limit in cases where the maps are of approximately equal 
size and are designed for the spines to clash in the worst way. An 
example would be if you're merging the set of all even numbers and the 
set of all odd numbers; you have to traverse the whole tree up to the 
last bit. Ditto for taking the union of a set with itself.


When the two maps are of vastly different sizes, O(min(m,n)) is a more 
intuitive way to think about it. Since you only have to look at the 
places where the spines clash, that will be on the order of the smaller map.


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe