Which version of GHC and which version of the Data.ByteString library? There was an inlining bug related to Data.Map /Data.IntMap performance fixed between the 6.8.x release and the current bytestring release.
In testing, Data.Map with strict bytestring keys matched the python (C implemented) dictionary, after I fixed the inlining for word lookups. You'll need to be using bytestring 0.9.1.x though. -- Don haskellmail: > Dear all, > > I am trying to implement the python-style dictionary in Haskell. > > Python dictionary is a data structure that maps one key to one value. > For instance, a python dictionary > d = {'a':1, 'b':2 } > maps key 'a' to 1, 'b' to 2. > Python dictionary allows for update. e.g. the statement > d['a'] = 3 > changes the value pointed by 'a' from 1 to 3. > > Internally, python dictionary is implemented using hash table. > > My first attempt is to use Data.HashTable. However it was immediately > abandoned, as I realize the memory usage is unreasonably huge. > > == SECOND ATTEMPT == > My second attempt is to use Data.Map > > {-# OPTIONS_GHC -fglasgow-exts #-} > module Main where > > import qualified Data.HashTable as HT > import qualified Data.IntMap as IM > import qualified Data.Map as DM > import qualified Data.ByteString.Char8 as S > import Data.Char > > -- the Dict type class > class Dict d k v | d -> k v where > empty :: d > insert :: k -> v -> d -> d > lookup :: k -> d -> Maybe v > update :: k -> v -> d -> d > > -- Let's use string as key > type Key = String > > -- insert key-value pairs into a dictionary > fromList :: Dict d k a => [(k,a)] -> d > fromList l = > foldl (\d (key,val) -> insert key val d) empty l > > instance Dict (DM.Map S.ByteString a) Key a where > empty = DM.empty > insert key val dm = > let packed_key = S.pack key > in DM.insert packed_key val dm > lookup key dm = > let packed_key = S.pack key > in DM.lookup packed_key dm > update key val dm = > let packed_key = S.pack key > in DM.update (\x -> Just val) packed_key dm > > Which kinda works, however since Map is implemented using a balanced tree, > therefore, > when as the dictionary grows, it takes a long time to insert new key-value > pair. > > == THIRD ATTEMPT == > My third attempt is to use Data.IntMap > > -- an implementation of Dict using IntMap > instance Dict (IM.IntMap a) Key a where > empty = IM.empty > insert key val im = > let int_key = fromIntegral (HT.hashString key) > in IM.insert int_key val im > lookup key im = > let int_key = fromIntegral (HT.hashString key) > in IM.lookup int_key im > update key val im = > let int_key = fromIntegral (HT.hashString key) > in IM.update (\x -> Just val) int_key im > > This implementation is faster than the Map approach, however this > implementation > can't handle collision well, two keys which are hashed into the same > integer will overwrite each other. > > == FOURTH ATTEMPT == > > My fourth implementation is to use Trie. The idea is to split a string (a > key) into > a list of 4-character chunks. Each chunk can be mapped into a 32-bit > integer without collision. > We then insert the value with this list of chunks into the Trie. > > -- an implementation of Dict using Trie > instance Dict (Trie a) Key a where > empty = emptyTrie > insert key val trie = > let key_chain = chain key > in insertTrie key_chain val trie > lookup key trie = > let key_chain = chain key > in lookupTrie key_chain trie > update key val trie = > let key_chain = chain key > in updateTrie key_chain val trie > > -- an auxillary function that "splits" string into small pieces, > -- 4 characters per piece, 4 chars = 32 bit > chain :: Key -> [Key] > chain k | length k > 4 = let (k',ks) = splitAt 4 k > in (k':chain ks) > | otherwise = [k] > > -- a collision-free hash function which turns four chars into Int32 > safehash :: [Char] -> Int > safehash cs | length cs > 4 = error "safehash failed." > | otherwise = > sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ] > > -- a trie datatype > data Trie a = Trie [a] (IM.IntMap (Trie a)) > > -- the empty trie > emptyTrie = Trie [] (IM.empty) > > -- insert value into the trie > insertTrie :: [String] -> a -> Trie a -> Trie a > insertTrie [] i (Trie is maps) = Trie (i:is) maps > insertTrie (word:words) i (Trie is maps) = > let key = safehash word > in case IM.lookup key maps of > { Just trie -> let trie' = insertTrie words i trie > maps' = IM.update (\x -> Just trie') key maps > in Trie is maps' > ; Nothing -> let trie = emptyTrie > trie' = insertTrie words i trie > maps' = IM.insert key trie' maps > in Trie is maps' > } > > -- lookup value from the trie > lookupTrie :: [String] -> Trie a -> Maybe a > lookupTrie [] (Trie vs _) = > case vs of > [] -> Nothing > (x:_) -> Just x > lookupTrie (word:words) (Trie is maps) = > let key = safehash word > in case IM.lookup key maps of > Just trie -> lookupTrie words trie > Nothing -> Nothing > > -- update the trie with the given value. > updateTrie :: [String] -> a -> Trie a -> Trie a > -- we only update the first value and leave the rest unchanged. > updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps > updateTrie (word:words) v (Trie is maps) = > let key = safehash word > in case IM.lookup key maps of > Just trie -> let trie' = updateTrie words v trie > maps' = IM.update (\x -> Just trie') key maps > in Trie is maps' > Nothing -> Trie is maps > > == BENCH MARK == > > I have a main function which builds a dictionary from a text file. > Each line of the file is a key-value pair separated by a space. > > e.g. > > key1 1 > key2 2 > ... > > main :: IO () > main = do { content <- readFile "in.txt" > ; let -- change this following type annotation > -- to change different type of the dictionary > -- dict :: DM.Map S.ByteString Int > -- dict :: IM.IntMap Int > dict :: Trie Int > dict = fromList (map parse_a_line (lines content)) > ; case Main.lookup "key256" dict of > { Just v -> putStrLn (show v) > ; Nothing -> putStrLn "Not found" > } > -- read a line here so that we can pause the program > -- and look at the memory usage. > ; v <- readLn > ; putStrLn v > } > where parse_a_line :: String -> (Key,Int) > parse_a_line line = case words line of > [key,val] -> (key,read val) > _ -> error " parse error. " > > I tested all three implementations by building a dictionary of size > 1000000. > The result shows that the Map and the Trie approaches handle collision > well, but > the IntMap approach does not. > > Here is a comparison of memory usage > > Map : 345 MB > IntMap : 146 MB > Trie : 282 MB > Python : 94 MB > > Here is a comparison of execution time (on an intel dual core 2.0G) > > Map: 26 sec > IntMap: 9 sec > Trie: 12 sec > Python: 2.24 sec > > The above number shows that my implementations of python style dictionary > are space/time in-efficient as compared to python. > > Can some one point out what's wrong with my implementations? > > I've attached my code in the tgz file. > > Cheers, > Kenny > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe