Great. Assuming you're following the advice to use bytestrings, please install the newest bytestring library version, here,
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring Data.Map or Data.IntMap with bytestrings should be quite efficient. (or use a trie if more precision is needed) -- Don haskellmail: > Dear Don, > I am using GHC 6.8.1 > > Regards, > Kenny > > On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <[EMAIL PROTECTED]> wrote: > > Which version of GHC and which version of the Data.ByteString library? > There was an inlining bug related to Data.Map /Data.IntMap performance > fixed between the 6.8.x release and the current bytestring release. > > In testing, Data.Map with strict bytestring keys matched the python (C > implemented) dictionary, after I fixed the inlining for word lookups. > > You'll need to be using bytestring 0.9.1.x though. > > -- Don > > haskellmail: > > Dear all, > > > > I am trying to implement the python-style dictionary in Haskell. > > > > Python dictionary is a data structure that maps one key to one > value. > > For instance, a python dictionary > > d = {'a':1, 'b':2 } > > maps key 'a' to 1, 'b' to 2. > > Python dictionary allows for update. e.g. the statement > > d['a'] = 3 > > changes the value pointed by 'a' from 1 to 3. > > > > Internally, python dictionary is implemented using hash table. > > > > My first attempt is to use Data.HashTable. However it was > immediately > > abandoned, as I realize the memory usage is unreasonably huge. > > > > == SECOND ATTEMPT == > > My second attempt is to use Data.Map > > > > {-# OPTIONS_GHC -fglasgow-exts #-} > > module Main where > > > > import qualified Data.HashTable as HT > > import qualified Data.IntMap as IM > > import qualified Data.Map as DM > > import qualified Data.ByteString.Char8 as S > > import Data.Char > > > > -- the Dict type class > > class Dict d k v | d -> k v where > > empty :: d > > insert :: k -> v -> d -> d > > lookup :: k -> d -> Maybe v > > update :: k -> v -> d -> d > > > > -- Let's use string as key > > type Key = String > > > > -- insert key-value pairs into a dictionary > > fromList :: Dict d k a => [(k,a)] -> d > > fromList l = > > foldl (\d (key,val) -> insert key val d) empty l > > > > instance Dict (DM.Map S.ByteString a) Key a where > > empty = DM.empty > > insert key val dm = > > let packed_key = S.pack key > > in DM.insert packed_key val dm > > lookup key dm = > > let packed_key = S.pack key > > in DM.lookup packed_key dm > > update key val dm = > > let packed_key = S.pack key > > in DM.update (\x -> Just val) packed_key dm > > > > Which kinda works, however since Map is implemented using a > balanced tree, > > therefore, > > when as the dictionary grows, it takes a long time to insert new > key-value > > pair. > > > > == THIRD ATTEMPT == > > My third attempt is to use Data.IntMap > > > > -- an implementation of Dict using IntMap > > instance Dict (IM.IntMap a) Key a where > > empty = IM.empty > > insert key val im = > > let int_key = fromIntegral (HT.hashString key) > > in IM.insert int_key val im > > lookup key im = > > let int_key = fromIntegral (HT.hashString key) > > in IM.lookup int_key im > > update key val im = > > let int_key = fromIntegral (HT.hashString key) > > in IM.update (\x -> Just val) int_key im > > > > This implementation is faster than the Map approach, however this > > implementation > > can't handle collision well, two keys which are hashed into the > same > > integer will overwrite each other. > > > > == FOURTH ATTEMPT == > > > > My fourth implementation is to use Trie. The idea is to split a > string (a > > key) into > > a list of 4-character chunks. Each chunk can be mapped into a > 32-bit > > integer without collision. > > We then insert the value with this list of chunks into the Trie. > > > > -- an implementation of Dict using Trie > > instance Dict (Trie a) Key a where > > empty = emptyTrie > > insert key val trie = > > let key_chain = chain key > > in insertTrie key_chain val trie > > lookup key trie = > > let key_chain = chain key > > in lookupTrie key_chain trie > > update key val trie = > > let key_chain = chain key > > in updateTrie key_chain val trie > > > > -- an auxillary function that "splits" string into small pieces, > > -- 4 characters per piece, 4 chars = 32 bit > > chain :: Key -> [Key] > > chain k | length k > 4 = let (k',ks) = splitAt 4 k > > in (k':chain ks) > > | otherwise = [k] > > > > -- a collision-free hash function which turns four chars into Int32 > > safehash :: [Char] -> Int > > safehash cs | length cs > 4 = error "safehash failed." > > | otherwise = > > sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ] > > > > -- a trie datatype > > data Trie a = Trie [a] (IM.IntMap (Trie a)) > > > > -- the empty trie > > emptyTrie = Trie [] (IM.empty) > > > > -- insert value into the trie > > insertTrie :: [String] -> a -> Trie a -> Trie a > > insertTrie [] i (Trie is maps) = Trie (i:is) maps > > insertTrie (word:words) i (Trie is maps) = > > let key = safehash word > > in case IM.lookup key maps of > > { Just trie -> let trie' = insertTrie words i trie > > maps' = IM.update (\x -> Just trie') key maps > > in Trie is maps' > > ; Nothing -> let trie = emptyTrie > > trie' = insertTrie words i trie > > maps' = IM.insert key trie' maps > > in Trie is maps' > > } > > > > -- lookup value from the trie > > lookupTrie :: [String] -> Trie a -> Maybe a > > lookupTrie [] (Trie vs _) = > > case vs of > > [] -> Nothing > > (x:_) -> Just x > > lookupTrie (word:words) (Trie is maps) = > > let key = safehash word > > in case IM.lookup key maps of > > Just trie -> lookupTrie words trie > > Nothing -> Nothing > > > > -- update the trie with the given value. > > updateTrie :: [String] -> a -> Trie a -> Trie a > > -- we only update the first value and leave the rest unchanged. > > updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps > > updateTrie (word:words) v (Trie is maps) = > > let key = safehash word > > in case IM.lookup key maps of > > Just trie -> let trie' = updateTrie words v trie > > maps' = IM.update (\x -> Just trie') key maps > > in Trie is maps' > > Nothing -> Trie is maps > > > > == BENCH MARK == > > > > I have a main function which builds a dictionary from a text file. > > Each line of the file is a key-value pair separated by a space. > > > > e.g. > > > > key1 1 > > key2 2 > > ... > > > > main :: IO () > > main = do { content <- readFile "in.txt" > > ; let -- change this following type annotation > > -- to change different type of the dictionary > > -- dict :: DM.Map S.ByteString Int > > -- dict :: IM.IntMap Int > > dict :: Trie Int > > dict = fromList (map parse_a_line (lines content)) > > ; case Main.lookup "key256" dict of > > { Just v -> putStrLn (show v) > > ; Nothing -> putStrLn "Not found" > > } > > -- read a line here so that we can pause the program > > -- and look at the memory usage. > > ; v <- readLn > > ; putStrLn v > > } > > where parse_a_line :: String -> (Key,Int) > > parse_a_line line = case words line of > > [key,val] -> (key,read val) > > _ -> error " parse error. " > > > > I tested all three implementations by building a dictionary of size > > 1000000. > > The result shows that the Map and the Trie approaches handle > collision > > well, but > > the IntMap approach does not. > > > > Here is a comparison of memory usage > > > > Map : 345 MB > > IntMap : 146 MB > > Trie : 282 MB > > Python : 94 MB > > > > Here is a comparison of execution time (on an intel dual core 2.0G) > > > > Map: 26 sec > > IntMap: 9 sec > > Trie: 12 sec > > Python: 2.24 sec > > > > The above number shows that my implementations of python style > dictionary > > are space/time in-efficient as compared to python. > > > > Can some one point out what's wrong with my implementations? > > > > I've attached my code in the tgz file. > > > > Cheers, > > Kenny > > > _______________________________________________ > > Haskell-Cafe mailing list > > [EMAIL PROTECTED] > > [3]http://www.haskell.org/mailman/listinfo/haskell-cafe > > References > > Visible links > 1. mailto:[EMAIL PROTECTED] > 2. mailto:Haskell-Cafe@haskell.org > 3. http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe