Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-22 Thread Ryan Ingram
On Sat, Nov 22, 2008 at 1:20 PM, Jason Dusek <[EMAIL PROTECTED]> wrote:
> Ryan Ingram <[EMAIL PROTECTED]> wrote:
>> ...persistent data structures tend to have much worse constant
>> factors and those factors translate to a general 2x-3x
>> slowdown.
>
>  Can you explain why that is, or provide a citation for it?
>  It's not something I've found easy to Google.

Consider insertion into a simple binary tree (no balancing condition).

The persistent algorithm is:

insert :: Key -> Tree -> Tree
insert k Tip = Node k Nil Nil
insert k (Node k' l r)
| k < k' = Node k' (insert k l) r
| otherwise = Node k' l (insert k r)

The ephemeral algorithm is:

insert :: Key -> IORef Tree -> IO ()
insert k p = do
t <- readIORef p
case t of
Tip -> do
l <- newIORef Tip
r <- newIORef Tip
writeIORef p (Node k l r)
Node k' l r -> insert k $ if k < k' then l else r

The big difference between these two algorithms is the amount of
allocation and copying going on.  Both dereference basically the same
number of pointers.  The ephemeral algorithm allocates exactly one new
node and modifies exactly one pointer in memory.  The persistent
algorithm, on the other hand, copies the entire path being traversed
down the tree, allocating that many nodes as well.  (All of the "Tip"
nodes can be shared; it can be treated like "NULL" in C)

Unfortunately, I don't have any references; the 2-3x is an intuitive
number from my past experience.  It's worse for algorithms where you
need to explicitly simulate pointers with maps because the structure
is inherently ephemeral, and better for simple structures like the
aforementioned binary tree.

   -- ryan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-22 Thread Janis Voigtlaender

Ryan Ingram wrote:

On Sat, Nov 22, 2008 at 5:33 AM, Janis Voigtlaender
<[EMAIL PROTECTED]> wrote:


You can generally make a persistent data structure with the same
asymptotic bounds as the ephemeral structure, ...


I would be very careful with the "generally" here. At least, I am not
aware that this has been proved to always be possible.



Here's an informal proof:

You can use an intmap to emulate pointers, to turn any ephemeral data
structure into a persistent one.  That adds at most a log-n factor to
lookups and updates.  For many structures, this is enough to prove
asymptotic bounds equivalence.

However, the standard 'pointer' model for ephemeral structures makes
the assumption that memory size is limited; otherwise you have to add
a log(n) factor there anyways, both to hold the large pointer values
that get generated and to actually send the bits to the memory bus.
Given this assumption you can take log(memory size) as a constant and
argue that pointer lookup and update via the map is O(1).


Ah, that makes sense.

--
Dr. Janis Voigtlaender
http://wwwtcs.inf.tu-dresden.de/~voigt/
mailto:[EMAIL PROTECTED]
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-22 Thread Jason Dusek
Ryan Ingram <[EMAIL PROTECTED]> wrote:
> ...persistent data structures tend to have much worse constant
> factors and those factors translate to a general 2x-3x
> slowdown.

  Can you explain why that is, or provide a citation for it?
  It's not something I've found easy to Google.

--
_jsn
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-22 Thread Ryan Ingram
On Sat, Nov 22, 2008 at 5:33 AM, Janis Voigtlaender
<[EMAIL PROTECTED]> wrote:
>> You can generally make a persistent data structure with the same
>> asymptotic bounds as the ephemeral structure, ...
>
> I would be very careful with the "generally" here. At least, I am not
> aware that this has been proved to always be possible.

Here's an informal proof:

You can use an intmap to emulate pointers, to turn any ephemeral data
structure into a persistent one.  That adds at most a log-n factor to
lookups and updates.  For many structures, this is enough to prove
asymptotic bounds equivalence.

However, the standard 'pointer' model for ephemeral structures makes
the assumption that memory size is limited; otherwise you have to add
a log(n) factor there anyways, both to hold the large pointer values
that get generated and to actually send the bits to the memory bus.
Given this assumption you can take log(memory size) as a constant and
argue that pointer lookup and update via the map is O(1).

> Also, in
> assertions about "the same asymptotic bounds", in this and a previous
> post in this thread, a distinction is important between worst-case and
> amortized costs. Just to complete the picture...

That's true, but I think the more important distinction is the
constant attached to the big O; persistent data structures tend to
have much worse constant factors and those factors translate to a
general 2x-3x slowdown.  It's often true that a worse asymptotic cost
algorithm is better because the constant factors are much better and
the expected N in your program is small enough.

  -- ryan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-22 Thread Janis Voigtlaender

Ryan Ingram wrote:

On Tue, Nov 18, 2008 at 12:46 PM, Luke Palmer <[EMAIL PROTECTED]> wrote:


But when these persistent data structures are used in a
single-threaded way, why should we not hope for the performance to be
comparable?



If you can guarantee single-threaded use, then you can just use ST and
implement the ephemeral structure, right?



It may not be easy, but just saying "they are persistent" is not
really an excuse.



You can generally make a persistent data structure with the same
asymptotic bounds as the ephemeral structure, ...


I would be very careful with the "generally" here. At least, I am not
aware that this has been proved to always be possible. Also, in
assertions about "the same asymptotic bounds", in this and a previous
post in this thread, a distinction is important between worst-case and
amortized costs. Just to complete the picture...

Ciao, Janis.

--
Dr. Janis Voigtlaender
http://wwwtcs.inf.tu-dresden.de/~voigt/
mailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-19 Thread Ketil Malde
Tillmann Rendel <[EMAIL PROTECTED]> writes:

> Why should a Haskell hash table need more memory then a Python hash
> table? I've heard that Data.HashTable is bad, so maybe writing a good
> one could be an option.

One problem is that Haskell collections are lazy by default.  I'm 
aware of a few use cases where laziness lets you formulate a very
elegant recursive population of a collection, but I think that in
general, strictness is what you want, and further, that if you want
lazy, you store your data as one-tuples: data Lazy a = Lazy a  

(If there's a workaround/solution in the other direction, I'd really
like to hear it).

I'm therefore tempted to suggest that collections should be
strict by default, and in particular, that there should be strict
arrays for arbitrary types, not just the ones that happen to be
unboxable. Unboxing should be an optimization for (some) strict
arrays.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-19 Thread Don Stewart
dave:
> On Tue, Nov 18, 2008 at 3:52 PM, Don Stewart <[EMAIL PROTECTED]> wrote:
> > dave:
> >> 2008/11/18 kenny lu <[EMAIL PROTECTED]>:
> >> > The above number shows that my implementations of python style dictionary
> >> > are space/time in-efficient as compared to python.
> >> >
> >> > Can some one point out what's wrong with my implementations?
> >>
> >> This isn't really a fair comparison. Map, IntMap, and Trie are
> >> persistent data structures, and Python dictionaries are ephemeral.
> >> (That is, when you "add" a key to a Map, you actually create a new one
> >> that shares structure with the old one, and both can be used in
> >> subsequent code. In Python, you would have to copy the dictionary.)
> >>
> >
> > Strings, not ByteStrings. that's the difference.
> 
> Is that in response to what I wrote, or to the original question?
> 
> Speaking of ByteStrings and tries, has anyone implemented a Patricia
> Trie for ByteStrings? I started putting one together a while back, but
> I got distracted and never finished it.

I started putting one together a while back but I got distracted and
never finished it. I think its a couple of days polishing.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Luke Palmer
On Tue, Nov 18, 2008 at 8:51 PM, Ryan Ingram <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 18, 2008 at 12:46 PM, Luke Palmer <[EMAIL PROTECTED]> wrote:
>> But when these persistent data structures are used in a
>> single-threaded way, why should we not hope for the performance to be
>> comparable?
>
> If you can guarantee single-threaded use, then you can just use ST and
> implement the ephemeral structure, right?

But that requires a special reimplementation.

>> It may not be easy, but just saying "they are persistent" is not
>> really an excuse.
>
> You can generally make a persistent data structure with the same
> asymptotic bounds as the ephemeral structure, but the constant hidden
> inside the O() will generally be worse.

I say this as a goal.  If we're in a performance competition, we can't
say "well, it's okay that Haskell is slower because its data
structures can be used persistently".  Python's dictionaries can also,
by inserting explicit copies.  In this use case Python performs
better, and we should strive to perform as well as it does.
Persistence has no bearing on this, because the persistence is not
used.

I'm not saying it's always possible to perform just as well.  But
persistence *by itself* is not a valid argument for poor performance.

Luke
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Ryan Ingram
On Tue, Nov 18, 2008 at 12:46 PM, Luke Palmer <[EMAIL PROTECTED]> wrote:
> But when these persistent data structures are used in a
> single-threaded way, why should we not hope for the performance to be
> comparable?

If you can guarantee single-threaded use, then you can just use ST and
implement the ephemeral structure, right?

> It may not be easy, but just saying "they are persistent" is not
> really an excuse.

You can generally make a persistent data structure with the same
asymptotic bounds as the ephemeral structure, but the constant hidden
inside the O() will generally be worse.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread David Menendez
On Tue, Nov 18, 2008 at 3:46 PM, Luke Palmer <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 18, 2008 at 10:37 AM, David Menendez <[EMAIL PROTECTED]> wrote:
>> This isn't really a fair comparison. Map, IntMap, and Trie are
>> persistent data structures, and Python dictionaries are ephemeral.
>> (That is, when you "add" a key to a Map, you actually create a new one
>> that shares structure with the old one, and both can be used in
>> subsequent code. In Python, you would have to copy the dictionary.)
>
> But when these persistent data structures are used in a
> single-threaded way, why should we not hope for the performance to be
> comparable?
>
> It may not be easy, but just saying "they are persistent" is not
> really an excuse.

I guess that depends on what you mean by "comparable". Chris Okasaki
demonstrated that, for some data structures, a persistent
implementation could be made with the same asymptotic bounds as an
ephemeral implementation, but I would still expect the persistent
version to be worse by a constant factor when used ephemerally.
Ephemeral data structures are naturally optimized for ephemeral use
cases. (I would also expect the reverse to be true.)

-- 
Dave Menendez <[EMAIL PROTECTED]>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread David Menendez
On Tue, Nov 18, 2008 at 3:52 PM, Don Stewart <[EMAIL PROTECTED]> wrote:
> dave:
>> 2008/11/18 kenny lu <[EMAIL PROTECTED]>:
>> > The above number shows that my implementations of python style dictionary
>> > are space/time in-efficient as compared to python.
>> >
>> > Can some one point out what's wrong with my implementations?
>>
>> This isn't really a fair comparison. Map, IntMap, and Trie are
>> persistent data structures, and Python dictionaries are ephemeral.
>> (That is, when you "add" a key to a Map, you actually create a new one
>> that shares structure with the old one, and both can be used in
>> subsequent code. In Python, you would have to copy the dictionary.)
>>
>
> Strings, not ByteStrings. that's the difference.

Is that in response to what I wrote, or to the original question?

Speaking of ByteStrings and tries, has anyone implemented a Patricia
Trie for ByteStrings? I started putting one together a while back, but
I got distracted and never finished it.

-- 
Dave Menendez <[EMAIL PROTECTED]>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


clojure's data structures (was: Re: [Haskell-cafe] implementing python-style dictionary in Haskell)

2008-11-18 Thread Evan Laforge
This actually brings up something I was wondering about lately.  I
recently stumbled across a language called clojure, which is a
lisp-like that runs on the JVM.  The interesting thing is that
mutations go through a transactional mutable reference system, and the
other datastructures are all immutable.  The author implements an
immutable hash map with a trie on each 5 bits of the hash, so each
node potentially has 32 children.  This means that lookups are
effectively O(1) and alterations have to copy a maximum of 7 chunks of
32 pointers.  Extendable "vectors" are implemented similarly.

The hash tables sound basically like an IntMap on the hash code,
except as far as I know, IntMap's patricia tree splits up the bit
space adaptively instead of being hardcoded on 5 bit segments.  I'm
not sure what the performance difference would be.  I haven't run any
tests, but the clojure author claims his vector and hash map are quite
fast.  I suppose the extendable vector corresponds to the current crop
of ByteString-like variants, so we sort of already have that, though
in a kind of chaotic and inchoate way.

I'm also curious about how far the generalized trie SoC project got...
 it should be more accessible now that we have a mainline release with
ATs, right?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Don Stewart
dave:
> 2008/11/18 kenny lu <[EMAIL PROTECTED]>:
> > Here is a comparison of memory usage
> >
> > Map : 345 MB
> > IntMap : 146 MB
> > Trie : 282 MB
> > Python : 94 MB
> >
> > Here is a comparison of execution time (on an intel dual core 2.0G)
> >
> > Map: 26 sec
> > IntMap: 9 sec
> > Trie: 12 sec
> > Python: 2.24 sec
> >
> >
> > The above number shows that my implementations of python style dictionary
> > are space/time in-efficient as compared to python.
> >
> > Can some one point out what's wrong with my implementations?
> 
> This isn't really a fair comparison. Map, IntMap, and Trie are
> persistent data structures, and Python dictionaries are ephemeral.
> (That is, when you "add" a key to a Map, you actually create a new one
> that shares structure with the old one, and both can be used in
> subsequent code. In Python, you would have to copy the dictionary.)
> 

Strings, not ByteStrings. that's the difference.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Luke Palmer
On Tue, Nov 18, 2008 at 10:37 AM, David Menendez <[EMAIL PROTECTED]> wrote:
> This isn't really a fair comparison. Map, IntMap, and Trie are
> persistent data structures, and Python dictionaries are ephemeral.
> (That is, when you "add" a key to a Map, you actually create a new one
> that shares structure with the old one, and both can be used in
> subsequent code. In Python, you would have to copy the dictionary.)

But when these persistent data structures are used in a
single-threaded way, why should we not hope for the performance to be
comparable?

It may not be easy, but just saying "they are persistent" is not
really an excuse.

Luke
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread David Menendez
2008/11/18 kenny lu <[EMAIL PROTECTED]>:
> Here is a comparison of memory usage
>
> Map : 345 MB
> IntMap : 146 MB
> Trie : 282 MB
> Python : 94 MB
>
> Here is a comparison of execution time (on an intel dual core 2.0G)
>
> Map: 26 sec
> IntMap: 9 sec
> Trie: 12 sec
> Python: 2.24 sec
>
>
> The above number shows that my implementations of python style dictionary
> are space/time in-efficient as compared to python.
>
> Can some one point out what's wrong with my implementations?

This isn't really a fair comparison. Map, IntMap, and Trie are
persistent data structures, and Python dictionaries are ephemeral.
(That is, when you "add" a key to a Map, you actually create a new one
that shares structure with the old one, and both can be used in
subsequent code. In Python, you would have to copy the dictionary.)

-- 
Dave Menendez <[EMAIL PROTECTED]>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Don Stewart
Great. Assuming you're following the advice to use bytestrings, please
install the newest bytestring library version, here,

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring

Data.Map or Data.IntMap with bytestrings should be quite efficient.
(or use a trie if more precision is needed)

-- Don

haskellmail:
>Dear Don,
>I am using GHC 6.8.1
> 
>Regards,
>Kenny
> 
>On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <[EMAIL PROTECTED]> wrote:
> 
>  Which version of GHC and which version of the Data.ByteString library?
>  There was an inlining bug related to Data.Map /Data.IntMap performance
>  fixed between the 6.8.x release and the current bytestring release.
> 
>  In testing, Data.Map with strict bytestring keys matched the python (C
>  implemented) dictionary, after I fixed the inlining for word lookups.
> 
>  You'll need to be using bytestring 0.9.1.x though.
> 
>  -- Don
> 
>  haskellmail:
>  >Dear all,
>  >
>  >I am trying to implement the python-style dictionary in Haskell.
>  >
>  >Python dictionary is a data structure that maps one key to one
>  value.
>  >For instance, a python dictionary
>  >d = {'a':1, 'b':2 }
>  >maps key 'a' to 1, 'b' to 2.
>  >Python dictionary allows for update. e.g. the statement
>  >d['a'] = 3
>  >changes the value pointed by 'a' from 1 to 3.
>  >
>  >Internally, python dictionary is implemented using hash table.
>  >
>  >My first attempt is to use Data.HashTable. However it was
>  immediately
>  >abandoned, as I realize the memory usage is unreasonably huge.
>  >
>  >== SECOND ATTEMPT ==
>  >My second attempt is to use Data.Map
>  >
>  >{-# OPTIONS_GHC -fglasgow-exts #-}
>  >module Main where
>  >
>  >import qualified Data.HashTable as HT
>  >import qualified Data.IntMap as IM
>  >import qualified Data.Map as DM
>  >import qualified Data.ByteString.Char8 as S
>  >import Data.Char
>  >
>  >-- the Dict type class
>  >class Dict d k v | d -> k v where
>  >empty :: d
>  >insert :: k -> v -> d -> d
>  >lookup :: k -> d -> Maybe v
>  >update :: k -> v -> d -> d
>  >
>  >-- Let's use string as key
>  >type Key = String
>  >
>  >-- insert key-value pairs into a dictionary
>  >fromList :: Dict d k a => [(k,a)] -> d
>  >fromList l =
>  >foldl (\d (key,val) -> insert key val d) empty l
>  >
>  >instance Dict (DM.Map S.ByteString a) Key a where
>  >empty = DM.empty
>  >insert key val dm =
>  >let packed_key = S.pack key
>  >in DM.insert packed_key val dm
>  >lookup key dm =
>  >let packed_key = S.pack key
>  >in DM.lookup packed_key dm
>  >update key val dm =
>  >let packed_key = S.pack key
>  >in DM.update (\x -> Just val) packed_key dm
>  >
>  >Which kinda works, however since Map is implemented using a
>  balanced tree,
>  >therefore,
>  >when as the dictionary grows, it takes a long time to insert new
>  key-value
>  >pair.
>  >
>  >== THIRD ATTEMPT ==
>  >My third attempt is to use Data.IntMap
>  >
>  >-- an implementation of Dict using IntMap
>  >instance Dict (IM.IntMap a) Key a where
>  >empty = IM.empty
>  >insert key val im =
>  >let int_key = fromIntegral (HT.hashString key)
>  >in IM.insert int_key val im
>  >lookup key im =
>  >let int_key = fromIntegral (HT.hashString key)
>  >in IM.lookup int_key im
>  >update key val im =
>  >let int_key = fromIntegral (HT.hashString key)
>  >in IM.update (\x -> Just val) int_key im
>  >
>  >This implementation is faster than the Map approach, however this
>  >implementation
>  >can't handle collision well, two keys which are hashed into the
>  same
>  >integer will overwrite each other.
>  >
>  >== FOURTH ATTEMPT ==
>  >
>  >My fourth implementation is to use Trie. The idea is to split a
>  string (a
>  >key) into
>  >a list of 4-character chunks. Each chunk can be mapped into a
>  32-bit
>  >integer without collision.
>  >We then insert the value with this list of chunks into the Trie.
>  >
>  >-- an implementation of Dict using Trie
>  >instance Dict (Trie a) Key a where
>  >empty = emptyTrie
>  >insert key val trie =
>  >let key_chain = chain key
>  >in insertTrie key_chain val trie
>  >lookup key trie =
>  >  

Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread kenny lu
Dear Don,
I am using GHC 6.8.1

Regards,
Kenny

On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <[EMAIL PROTECTED]> wrote:

> Which version of GHC and which version of the Data.ByteString library?
> There was an inlining bug related to Data.Map /Data.IntMap performance
> fixed between the 6.8.x release and the current bytestring release.
>
> In testing, Data.Map with strict bytestring keys matched the python (C
> implemented) dictionary, after I fixed the inlining for word lookups.
>
> You'll need to be using bytestring 0.9.1.x though.
>
> -- Don
>
> haskellmail:
> >Dear all,
> >
> >I am trying to implement the python-style dictionary in Haskell.
> >
> >Python dictionary is a data structure that maps one key to one value.
> >For instance, a python dictionary
> >d = {'a':1, 'b':2 }
> >maps key 'a' to 1, 'b' to 2.
> >Python dictionary allows for update. e.g. the statement
> >d['a'] = 3
> >changes the value pointed by 'a' from 1 to 3.
> >
> >Internally, python dictionary is implemented using hash table.
> >
> >My first attempt is to use Data.HashTable. However it was immediately
> >abandoned, as I realize the memory usage is unreasonably huge.
> >
> >== SECOND ATTEMPT ==
> >My second attempt is to use Data.Map
> >
> >{-# OPTIONS_GHC -fglasgow-exts #-}
> >module Main where
> >
> >import qualified Data.HashTable as HT
> >import qualified Data.IntMap as IM
> >import qualified Data.Map as DM
> >import qualified Data.ByteString.Char8 as S
> >import Data.Char
> >
> >-- the Dict type class
> >class Dict d k v | d -> k v where
> >empty :: d
> >insert :: k -> v -> d -> d
> >lookup :: k -> d -> Maybe v
> >update :: k -> v -> d -> d
> >
> >-- Let's use string as key
> >type Key = String
> >
> >-- insert key-value pairs into a dictionary
> >fromList :: Dict d k a => [(k,a)] -> d
> >fromList l =
> >foldl (\d (key,val) -> insert key val d) empty l
> >
> >instance Dict (DM.Map S.ByteString a) Key a where
> >empty = DM.empty
> >insert key val dm =
> >let packed_key = S.pack key
> >in DM.insert packed_key val dm
> >lookup key dm =
> >let packed_key = S.pack key
> >in DM.lookup packed_key dm
> >update key val dm =
> >let packed_key = S.pack key
> >in DM.update (\x -> Just val) packed_key dm
> >
> >Which kinda works, however since Map is implemented using a balanced
> tree,
> >therefore,
> >when as the dictionary grows, it takes a long time to insert new
> key-value
> >pair.
> >
> >== THIRD ATTEMPT ==
> >My third attempt is to use Data.IntMap
> >
> >-- an implementation of Dict using IntMap
> >instance Dict (IM.IntMap a) Key a where
> >empty = IM.empty
> >insert key val im =
> >let int_key = fromIntegral (HT.hashString key)
> >in IM.insert int_key val im
> >lookup key im =
> >let int_key = fromIntegral (HT.hashString key)
> >in IM.lookup int_key im
> >update key val im =
> >let int_key = fromIntegral (HT.hashString key)
> >in IM.update (\x -> Just val) int_key im
> >
> >This implementation is faster than the Map approach, however this
> >implementation
> >can't handle collision well, two keys which are hashed into the same
> >integer will overwrite each other.
> >
> >== FOURTH ATTEMPT ==
> >
> >My fourth implementation is to use Trie. The idea is to split a string
> (a
> >key) into
> >a list of 4-character chunks. Each chunk can be mapped into a 32-bit
> >integer without collision.
> >We then insert the value with this list of chunks into the Trie.
> >
> >-- an implementation of Dict using Trie
> >instance Dict (Trie a) Key a where
> >empty = emptyTrie
> >insert key val trie =
> >let key_chain = chain key
> >in insertTrie key_chain val trie
> >lookup key trie =
> >let key_chain = chain key
> >in lookupTrie key_chain trie
> >update key val trie =
> >let key_chain = chain key
> >in updateTrie key_chain val trie
> >
> >-- an auxillary function that "splits" string into small pieces,
> >-- 4 characters per piece, 4 chars = 32 bit
> >chain :: Key -> [Key]
> >chain k | length k > 4 = let (k',ks) = splitAt 4 k
> > in (k':chain ks)
> >| otherwise= [k]
> >
> >-- a collision-free hash function which turns four chars into Int32
> >safehash :: [Char] -> Int
> >safehash cs | length cs > 4 = error "safehash failed."
> >| otherwise =
> >sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]
> >
> >-- a trie datatype
> >data Trie a = Trie [a] (IM.IntMap (Trie a))
> >
> >-- the empty trie
> >emptyTrie = Trie [] (IM.empty)
> >
> >-- insert value into the trie
> >

Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Don Stewart
Which version of GHC and which version of the Data.ByteString library?
There was an inlining bug related to Data.Map /Data.IntMap performance
fixed between the 6.8.x release and the current bytestring release.

In testing, Data.Map with strict bytestring keys matched the python (C
implemented) dictionary, after I fixed the inlining for word lookups.

You'll need to be using bytestring 0.9.1.x though.

-- Don

haskellmail:
>Dear all,
> 
>I am trying to implement the python-style dictionary in Haskell.
> 
>Python dictionary is a data structure that maps one key to one value.
>For instance, a python dictionary
>d = {'a':1, 'b':2 }
>maps key 'a' to 1, 'b' to 2.
>Python dictionary allows for update. e.g. the statement
>d['a'] = 3
>changes the value pointed by 'a' from 1 to 3.
> 
>Internally, python dictionary is implemented using hash table.
> 
>My first attempt is to use Data.HashTable. However it was immediately
>abandoned, as I realize the memory usage is unreasonably huge.
> 
>== SECOND ATTEMPT ==
>My second attempt is to use Data.Map
> 
>{-# OPTIONS_GHC -fglasgow-exts #-}
>module Main where
> 
>import qualified Data.HashTable as HT
>import qualified Data.IntMap as IM
>import qualified Data.Map as DM
>import qualified Data.ByteString.Char8 as S
>import Data.Char
> 
>-- the Dict type class
>class Dict d k v | d -> k v where
>empty :: d
>insert :: k -> v -> d -> d
>lookup :: k -> d -> Maybe v
>update :: k -> v -> d -> d
> 
>-- Let's use string as key
>type Key = String
> 
>-- insert key-value pairs into a dictionary
>fromList :: Dict d k a => [(k,a)] -> d
>fromList l =
>foldl (\d (key,val) -> insert key val d) empty l
> 
>instance Dict (DM.Map S.ByteString a) Key a where
>empty = DM.empty
>insert key val dm =
>let packed_key = S.pack key
>in DM.insert packed_key val dm
>lookup key dm =
>let packed_key = S.pack key
>in DM.lookup packed_key dm
>update key val dm =
>let packed_key = S.pack key
>in DM.update (\x -> Just val) packed_key dm
> 
>Which kinda works, however since Map is implemented using a balanced tree,
>therefore,
>when as the dictionary grows, it takes a long time to insert new key-value
>pair.
> 
>== THIRD ATTEMPT ==
>My third attempt is to use Data.IntMap
> 
>-- an implementation of Dict using IntMap
>instance Dict (IM.IntMap a) Key a where
>empty = IM.empty
>insert key val im =
>let int_key = fromIntegral (HT.hashString key)
>in IM.insert int_key val im
>lookup key im =
>let int_key = fromIntegral (HT.hashString key)
>in IM.lookup int_key im
>update key val im =
>let int_key = fromIntegral (HT.hashString key)
>in IM.update (\x -> Just val) int_key im
> 
>This implementation is faster than the Map approach, however this
>implementation
>can't handle collision well, two keys which are hashed into the same
>integer will overwrite each other.
> 
>== FOURTH ATTEMPT ==
> 
>My fourth implementation is to use Trie. The idea is to split a string (a
>key) into
>a list of 4-character chunks. Each chunk can be mapped into a 32-bit
>integer without collision.
>We then insert the value with this list of chunks into the Trie.
> 
>-- an implementation of Dict using Trie
>instance Dict (Trie a) Key a where
>empty = emptyTrie
>insert key val trie =
>let key_chain = chain key
>in insertTrie key_chain val trie
>lookup key trie =
>let key_chain = chain key
>in lookupTrie key_chain trie
>update key val trie =
>let key_chain = chain key
>in updateTrie key_chain val trie
> 
>-- an auxillary function that "splits" string into small pieces,
>-- 4 characters per piece, 4 chars = 32 bit
>chain :: Key -> [Key]
>chain k | length k > 4 = let (k',ks) = splitAt 4 k
> in (k':chain ks)
>| otherwise= [k]
> 
>-- a collision-free hash function which turns four chars into Int32
>safehash :: [Char] -> Int
>safehash cs | length cs > 4 = error "safehash failed."
>| otherwise =
>sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]
> 
>-- a trie datatype
>data Trie a = Trie [a] (IM.IntMap (Trie a))
> 
>-- the empty trie
>emptyTrie = Trie [] (IM.empty)
> 
>-- insert value into the trie
>insertTrie :: [String] -> a -> Trie a -> Trie a
>insertTrie [] i (Trie is maps) = Trie (i:is) maps
>insertTrie (word:words) i (Trie is maps) =
>let key = safehash word
>in case IM.lookup key maps of
> { Just trie -> let trie' = insertTrie words i trie
>maps' = IM.update (\x -> Just trie') key maps
>in Tr

Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Alberto G. Corona
The most balanced case may be the insertion of a element in the middle of  a
list, but this is far worst than to insert an element in a particular branch
of a tree ( it needs an average of list-lenght/2 element creations while in
a tree  needs only  (average-branch-length)/2)

I refer to Maps, because Hashtables, in the IO monad, are mutable.  by the
way  let map2= map1 takes 0 bytes of memory And both do not share side
effects, while creating two copies of a hastable to avoid side effects
between them needs 2 * size.

2008/11/18 Tillmann Rendel <[EMAIL PROTECTED]>

> Hello Alberto,
>
> I cc this to haskell-cafe again.
>
> Alberto G. Corona wrote:
>
>> Not so much memory, because data is referentially transparent, the new Map
>> can point to whole subtrees of the old map that stay the same. is similar
>> than  when a new list is created by prefixing a new element from a old
>> list
>> ys= x:xs.  ys is not at all a fresh copy, but  x plus a pointer to the
>> head
>> of xs. this is the only new data that is needed to create ys.
>>
>
> You could just as well compare with appending a new element to the end of
> the list, which needs a complete copy of the list structure to be made. One
> has to look more closely to see which case it is here.
>
> More specifically, I do not see how this sharing of substructures could be
> employed in the implementation of hash tables, which rely on O(1) random
> access into arrays.
>
>  Tillmann
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Tillmann Rendel

Hello Alberto,

I cc this to haskell-cafe again.

Alberto G. Corona wrote:

Not so much memory, because data is referentially transparent, the new Map
can point to whole subtrees of the old map that stay the same. is similar
than  when a new list is created by prefixing a new element from a old list
ys= x:xs.  ys is not at all a fresh copy, but  x plus a pointer to the head
of xs. this is the only new data that is needed to create ys.


You could just as well compare with appending a new element to the end 
of the list, which needs a complete copy of the list structure to be 
made. One has to look more closely to see which case it is here.


More specifically, I do not see how this sharing of substructures could 
be employed in the implementation of hash tables, which rely on O(1) 
random access into arrays.


  Tillmann
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Krzysztof Skrzętnicki
2008/11/18 kenny lu <[EMAIL PROTECTED]>:
> Dear all,
>
> I am trying to implement the python-style dictionary in Haskell.
>
> Python dictionary is a data structure that maps one key to one value.
> For instance, a python dictionary
> d = {'a':1, 'b':2 }
> maps key 'a' to 1, 'b' to 2.
> Python dictionary allows for update. e.g. the statement
> d['a'] = 3
> changes the value pointed by 'a' from 1 to 3.
>
> Internally, python dictionary is implemented using hash table.
>
> My first attempt is to use Data.HashTable. However it was immediately
> abandoned, as I realize the memory usage is unreasonably huge.
>
> ...
>
> I tested all three implementations by building a dictionary of size 100.
> The result shows that the Map and the Trie approaches handle collision well,
> but
> the IntMap approach does not.
>
>
> Here is a comparison of memory usage
>
> Map : 345 MB
> IntMap : 146 MB
> Trie : 282 MB
> Python : 94 MB
>
> Here is a comparison of execution time (on an intel dual core 2.0G)
>
> Map: 26 sec
> IntMap: 9 sec
> Trie: 12 sec
> Python: 2.24 sec
>
>
> The above number shows that my implementations of python style dictionary
> are space/time in-efficient as compared to python.
>
> Can some one point out what's wrong with my implementations?
>

First of all, you use Strings. That's a very bad thing when you care
about memory restrictions. Fire up ghci type something like this:
> let aas = replicate (1024*1024*10) 'a'
> -- 22 Mb memory usage
> length aas
> 10485760
> -- 270 Mb memory usage
10 Mb string caused as much as 250 Mb increase in ghci's memory consumption.

My guess? Use hashtables with ByteStrings.
I rewrote part of your code. Results are quite promising.

Haskell:
121 Mb total memory in use

 INIT  time0.02s  (  0.00s elapsed)
 MUT   time0.84s  (  1.00s elapsed)
 GCtime1.97s  (  2.02s elapsed)
 EXIT  time0.00s  (  0.00s elapsed)
 Total time2.83s  (  3.02s elapsed)

 %GC time  69.6%  (66.8% elapsed)
Python:
$ time python dict.py
256

real0m2.278s
user0m2.233s
sys 0m0.078s

memory: 101 Mb (as reported by Windows' Task Manager).

The code:

--- cut ---

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Int
import Data.Bits

(...)

parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int)
parse_a_line_BS line = case BS.words line of
  [key,val] -> (key,(read . BS.unpack) val)
  _ -> error " parse error.  "


main :: IO ()
main = do
  dict <- HT.new (==) hashByteString
  indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt")
  mapM_ (\ (k,v) -> HT.insert dict k v) indata
  HT.lookup dict (BS.pack "key256") >>= \v -> case v of
Just vv -> putStrLn (show vv)
Nothing -> putStrLn
("Not found")

-- derived from Data.HashTable.hashString
hashByteString :: BS.ByteString -> Int32
hashByteString = BS.foldl' f golden
   where f m c = fromIntegral (ord c) * magic + hashInt32 m
 magic = 0xdeadbeef

hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x

mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
   where r :: Int64
 r = fromIntegral a * fromIntegral b

golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32

--- cut ---

I had to rewrite hashString to work for ByteStrings - basically it's
just using different foldl'.

All best

Christopher Skrzętnicki
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Tillmann Rendel

kenny lu wrote:

Internally, python dictionary is implemented using hash table.

My first attempt is to use Data.HashTable. However it was immediately
abandoned, as I realize the memory usage is unreasonably huge.


Why should a Haskell hash table need more memory then a Python hash 
table? I've heard that Data.HashTable is bad, so maybe writing a good 
one could be an option.



Python dictionary allows for update. e.g. the statement
d['a'] = 3
changes the value pointed by 'a' from 1 to 3.


I understand "changes" in the sense of an destructive update: The hash 
table stays the same (in terms of "object identity"), but the content of 
the memory cell storing the value of d['a'] is changed in place. That 
means that the old hash table, with d['a'] == 1, doesn't exist anymore.



-- the Dict type class
class Dict d k v | d -> k v where
empty :: d
insert :: k -> v -> d -> d
lookup :: k -> d -> Maybe v
update :: k -> v -> d -> d


But here you want to create a new d, e.g. a whole new hash table, which 
contains mostly the same content, but one memory cell different. The old 
hash table still exists in memory. That is a totally different operation 
which is quite likely to need more memory.


  Tillmann
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread kenny lu
Hi Bulat,



> 1. why you think that your code should be faster? pythob
> implementation is probably written in C ince it's one of its core data
> structures
>

I am not hoping that my code should be faster, but at least not as slow as
what it gets.
Basically I am looking for an implementation which is close to the one in
python.


>
> 2. you can solve IntMap problem by storing list of values with the
> same hash in tree's nodes
>

Yeah, that would probably speed up the building time of the dictionary.
However, storing the list of values in the tree nodes requires storing their
original keys,
so that it maintains the one-key-to-one-value semantics. This would takes up
more space
compared to the Trie approach.

Regards,
Kenny
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Claus Reinke

main :: IO ()
main = do { content <- readFile "in.txt"
 ; let -- change this following type annotation
   -- to change different type of the dictionary
   -- dict :: DM.Map S.ByteString Int
   -- dict :: IM.IntMap Int
   dict :: Trie Int
   dict = fromList (map parse_a_line  (lines content))
..
   where  parse_a_line :: String -> (Key,Int)
  parse_a_line line = case words line of
 [key,val] -> (key,read val)
 _ -> error " parse error.  "


Maps tend to be strict in their keys, but not in their values.
You might be storing a lot of thunks with unparsed Strings 
instead of plain Int values.


Something like this might make a difference wrt memory usage:

  [key,val] -> ((,) key) $! (read val)

Hth,
Claus


Here is a comparison of memory usage

Map : 345 MB
IntMap : 146 MB
Trie : 282 MB
Python : 94 MB


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread kenny lu
No I do not consider unicode in these implementations.

On Tue, Nov 18, 2008 at 6:43 PM, Jason Dusek <[EMAIL PROTECTED]> wrote:

>  Did you use Unicode in Python?
>
> --
> _jsn
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Bulat Ziganshin
Hello kenny,

Tuesday, November 18, 2008, 1:37:36 PM, you wrote:

> The above number shows that my implementations of python style
> dictionary  are space/time in-efficient as compared to python. 

thanks, interesting code

1. why you think that your code should be faster? pythob
implementation is probably written in C ince it's one of its core data
structures

2. you can solve IntMap problem by storing list of values with the
same hash in tree's nodes


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] implementing python-style dictionary in Haskell

2008-11-18 Thread Jason Dusek
  Did you use Unicode in Python?

--
_jsn
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe