Dear all, I am trying to implement the python-style dictionary in Haskell.
Python dictionary is a data structure that maps one key to one value. For instance, a python dictionary d = {'a':1, 'b':2 } maps key 'a' to 1, 'b' to 2. Python dictionary allows for update. e.g. the statement d['a'] = 3 changes the value pointed by 'a' from 1 to 3. Internally, python dictionary is implemented using hash table. My first attempt is to use Data.HashTable. However it was immediately abandoned, as I realize the memory usage is unreasonably huge. == SECOND ATTEMPT == My second attempt is to use Data.Map {-# OPTIONS_GHC -fglasgow-exts #-} module Main where import qualified Data.HashTable as HT import qualified Data.IntMap as IM import qualified Data.Map as DM import qualified Data.ByteString.Char8 as S import Data.Char -- the Dict type class class Dict d k v | d -> k v where empty :: d insert :: k -> v -> d -> d lookup :: k -> d -> Maybe v update :: k -> v -> d -> d -- Let's use string as key type Key = String -- insert key-value pairs into a dictionary fromList :: Dict d k a => [(k,a)] -> d fromList l = foldl (\d (key,val) -> insert key val d) empty l instance Dict (DM.Map S.ByteString a) Key a where empty = DM.empty insert key val dm = let packed_key = S.pack key in DM.insert packed_key val dm lookup key dm = let packed_key = S.pack key in DM.lookup packed_key dm update key val dm = let packed_key = S.pack key in DM.update (\x -> Just val) packed_key dm Which kinda works, however since Map is implemented using a balanced tree, therefore, when as the dictionary grows, it takes a long time to insert new key-value pair. == THIRD ATTEMPT == My third attempt is to use Data.IntMap -- an implementation of Dict using IntMap instance Dict (IM.IntMap a) Key a where empty = IM.empty insert key val im = let int_key = fromIntegral (HT.hashString key) in IM.insert int_key val im lookup key im = let int_key = fromIntegral (HT.hashString key) in IM.lookup int_key im update key val im = let int_key = fromIntegral (HT.hashString key) in IM.update (\x -> Just val) int_key im This implementation is faster than the Map approach, however this implementation can't handle collision well, two keys which are hashed into the same integer will overwrite each other. == FOURTH ATTEMPT == My fourth implementation is to use Trie. The idea is to split a string (a key) into a list of 4-character chunks. Each chunk can be mapped into a 32-bit integer without collision. We then insert the value with this list of chunks into the Trie. -- an implementation of Dict using Trie instance Dict (Trie a) Key a where empty = emptyTrie insert key val trie = let key_chain = chain key in insertTrie key_chain val trie lookup key trie = let key_chain = chain key in lookupTrie key_chain trie update key val trie = let key_chain = chain key in updateTrie key_chain val trie -- an auxillary function that "splits" string into small pieces, -- 4 characters per piece, 4 chars = 32 bit chain :: Key -> [Key] chain k | length k > 4 = let (k',ks) = splitAt 4 k in (k':chain ks) | otherwise = [k] -- a collision-free hash function which turns four chars into Int32 safehash :: [Char] -> Int safehash cs | length cs > 4 = error "safehash failed." | otherwise = sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ] -- a trie datatype data Trie a = Trie [a] (IM.IntMap (Trie a)) -- the empty trie emptyTrie = Trie [] (IM.empty) -- insert value into the trie insertTrie :: [String] -> a -> Trie a -> Trie a insertTrie [] i (Trie is maps) = Trie (i:is) maps insertTrie (word:words) i (Trie is maps) = let key = safehash word in case IM.lookup key maps of { Just trie -> let trie' = insertTrie words i trie maps' = IM.update (\x -> Just trie') key maps in Trie is maps' ; Nothing -> let trie = emptyTrie trie' = insertTrie words i trie maps' = IM.insert key trie' maps in Trie is maps' } -- lookup value from the trie lookupTrie :: [String] -> Trie a -> Maybe a lookupTrie [] (Trie vs _) = case vs of [] -> Nothing (x:_) -> Just x lookupTrie (word:words) (Trie is maps) = let key = safehash word in case IM.lookup key maps of Just trie -> lookupTrie words trie Nothing -> Nothing -- update the trie with the given value. updateTrie :: [String] -> a -> Trie a -> Trie a -- we only update the first value and leave the rest unchanged. updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps updateTrie (word:words) v (Trie is maps) = let key = safehash word in case IM.lookup key maps of Just trie -> let trie' = updateTrie words v trie maps' = IM.update (\x -> Just trie') key maps in Trie is maps' Nothing -> Trie is maps == BENCH MARK == I have a main function which builds a dictionary from a text file. Each line of the file is a key-value pair separated by a space. e.g. key1 1 key2 2 ... main :: IO () main = do { content <- readFile "in.txt" ; let -- change this following type annotation -- to change different type of the dictionary -- dict :: DM.Map S.ByteString Int -- dict :: IM.IntMap Int dict :: Trie Int dict = fromList (map parse_a_line (lines content)) ; case Main.lookup "key256" dict of { Just v -> putStrLn (show v) ; Nothing -> putStrLn "Not found" } -- read a line here so that we can pause the program -- and look at the memory usage. ; v <- readLn ; putStrLn v } where parse_a_line :: String -> (Key,Int) parse_a_line line = case words line of [key,val] -> (key,read val) _ -> error " parse error. " I tested all three implementations by building a dictionary of size 1000000. The result shows that the Map and the Trie approaches handle collision well, but the IntMap approach does not. Here is a comparison of memory usage Map : 345 MB IntMap : 146 MB Trie : 282 MB Python : 94 MB Here is a comparison of execution time (on an intel dual core 2.0G) Map: 26 sec IntMap: 9 sec Trie: 12 sec Python: 2.24 sec The above number shows that my implementations of python style dictionary are space/time in-efficient as compared to python. Can some one point out what's wrong with my implementations? I've attached my code in the tgz file. Cheers, Kenny
dict.tgz
Description: GNU Zip compressed data
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe