Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Felipe Lessa wrote: > On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart wrote: > > Looks like the Map reading/showing via association lists could do with > > further work. > > > > Anyone want to dig around in the Map instance? (There's also some patches > > for > > an alternative lazy Map serialisation, if people are keen to load maps -- > > happstack devs?). > > >>From binary-0.5: > > instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where > put m = put (Map.size m) >> mapM_ put (Map.toAscList m) > get = liftM Map.fromDistinctAscList get > > instance Binary a => Binary [a] where > put l = put (length l) >> mapM_ put l > get= do n <- get :: Get Int > replicateM n get > > Can't get better, I think. We can improve it slightly (about 20% runtime in dons example [*]): instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where get = liftM (Map.fromDistinctAscList . map strictValue) get where strictValue (k,v) = (v `seq` k, v) The point is that Data.Map.Map is strict in the keys, but not in the values of the map. In the case of deserialisation this means the values will be thunks that hang on to the Daya.Binary buffer. > Now, from containers-0.2.0.0: > > fromDistinctAscList :: [(k,a)] -> Map k a > fromDistinctAscList xs > = build const (length xs) xs > where > -- 1) use continutations so that we use heap space instead of stack space. > -- 2) special case for n==5 to build bushier trees. > build c 0 xs' = c Tip xs' > build c 5 xs' = case xs' of >((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx) > -> c (bin k4 x4 (bin k2 x2 (singleton k1 > x1) (singleton k3 x3)) (singleton k5 x5)) xx >_ -> error "fromDistinctAscList build" > build c n xs' = seq nr $ build (buildR nr c) nl xs' >where > nl = n `div` 2 > nr = n - nl - 1 > > buildR n c l ((k,x):ys) = build (buildB l k x c) n ys > buildR _ _ _ [] = error "fromDistinctAscList buildR []" > buildB l k x c r zs = c (bin k x l r) zs > > > The builds seem fine, but we spot a (length xs) on the beginning. > Maybe this is the culprit? We already know the size of the map (it was > serialized), so it is just a matter of exporting > > fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a Eliminating the 'length' call helps, too, improving runtime by another about 5%. The result is still a factor of 1.7 slower than reading the list of key/value pairs. Bertram [*] Notes on timings: 1) I used `rnf` for all timings, as in my previous mail. 2) I noticed that in my previous measurements, the GC time for the Data.Map tests was excessively large (70% and more), so I used +RTS -H32M this time. This resulted in a significant runtime improvement of about 30%. 3) Do your own measurements! Some code to play with is available here: http://int-e.home.tlink.de/haskell/MapTest.hs http://int-e.home.tlink.de/haskell/Map.hs ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Hello, On Tue, Feb 24, 2009 at 2:36 AM, Don Stewart wrote: > This idea was the motivation for the new Seq instance, which uses > internals to build quickly. > > Encoding to disk, the dictionary, > > $ time ./binary /usr/share/dict/cracklib-small > "done" > ./binary /usr/share/dict/cracklib-small 0.07s user 0.01s system 94% > cpu 0.088 total > > Decoding, > $ time ./binary dict.gz > 52848 > "done" > ./binary dict.gz 0.07s user 0.01s system 97% cpu 0.079 total > > instance (Binary e) => Binary (Seq.Seq e) where > put s = put (Seq.length s) >> Fold.mapM_ put s > get = do n <- get :: Get Int > rep Seq.empty n get > where rep xs 0 _ = return $! xs > rep xs n g = xs `seq` n `seq` do > x <- g > rep (xs Seq.|> x) (n-1) g > > > Just a lot better. :) > > So ... Data.Map, we're looking at you! Indeed, that was the motivation for writing the patch for Seq. [Ross, thank you again for the help.] I had performance issues with lists, but noticed that switching to Sequence wasn't helping at all. This new definition takes advantage of the features that Seq has and List doesn't. Regarding Map, I like Felipe's idea of having a separate Internal. I know that Lemmih has a few other implementations of Map (compactMap, BerkeleyDB). If I remember correctly, he made BerkeleyDB an instance of Binary. That can probably give us some insight too. Paulo ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Don Stewart wrote: > dons: [...] > Just serialising straight lists of pairs, [...] > And reading them back in, > > main = do > [f] <- getArgs > m <- decode `fmap` L.readFile f > print (length (m :: [(B.ByteString,Int)])) > print "done" Well, you don't actually read the whole list here, just its length: instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get= do n <- get :: Get Int replicateM n get To demonstrate, this works: main = do L.writeFile "v" (encode (42 :: Int)) m <- decode `fmap` L.readFile "v" print (length (m :: [Int])) So instead, we should try something like this: import Control.Parallel.Strategies instance NFData B.ByteString where rnf bs = bs `seq` () main = do [f] <- getArgs m <- decode `fmap` L.readFile f print (rnf m `seq` length (m :: [(B.ByteString,Int)])) My timings: reading list, without rnf: 0.04s with rnf: 0.16s reading a Data.Map: 0.52s with rnf: 0.62s Bertram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe]Data.Binary poor read performance]
btw, i always thought that it should be a way to overcome any export lists and go directly to module internals. limiting export is the way to protect programmer from errors, not security feature, and it should be left to programmer to decide when he don't need it. compilers should just be able to check whether some module/library imported abstractly or with internals too. this will render useless all those .Internals modules that now we need to add everywhere You're not alone!-) This has been called "Open Implementation" (OI, a pre-cursor of aspect-oriented programming): http://www2.parc.com/csl/groups/sda/projects/oi/ They argue for an explicit auxiliary interface instead of full access to module internals. Since these same folks worked on meta-object protocols as well (at the meta-level, the boundaries can be bypassed entirely), that suggestion probably comes from experience. They do allow for the auxiliary interface to use meta-programming style features, though that depends on the language in question (in Haskell, type classes or type functions might be used instead, but rewrite rules and Template Haskell are also available). Open Implementation Design Guidelines http://www2.parc.com/csl/groups/sda/publications/papers/Kiczales-ICSE97/ is a short paper discussing a Set API/Open Implementation example. I agree in principle, but GHC also uses that knowledge to optimize the code better - if a function is exported it has to produce the most polymorphic possible code for its type, if it isn't it can specialize better... that sort of thing. That refers to optimization in the provider module. As the OI people argued, optimization in the client modules also needs to be taken into account. If the default one-size-fits-all-uses implementation behind the default API doesn't work well enough, there'll be a proliferation of library variants. If there is a way to fine-tune the implementation via an auxiliary API, much of that can be avoided. In other words, instead of half a dozen Maps and a dozen Array variants, there'd just be one of each, but with auxiliary interfaces that would allow client code to choose and tune the most suitable implementations behind the main interfaces. It isn't magic, though: if, say, even the auxiliary API doesn't allow you to say "thanks, but I know the length already", you're still stuck. But it does help to think about designing 2 APIs (the default functionality one and the auxiliary fine-tuning one) instead of offering only the choice of 1 API or full access to internals. Claus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Export restrictions :) Re[4]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Hello Svein, Tuesday, February 24, 2009, 3:47:44 PM, you wrote: >> btw, i always thought that it should be a way to overcome any export >> lists and go directly to module internals. limiting export is the way >> to protect programmer from errors, not security feature, and it should >> be left to programmer to decide when he don't need it. compilers >> should just be able to check whether some module/library imported >> abstractly or with internals too. this will render useless all those >> .Internals modules that now we need to add everywhere >> > I agree in principle, but GHC also uses that knowledge to optimize the > code better - if a function is exported it has to produce the most > polymorphic possible code for its type, if it isn't it can specialize > better... that sort of thing. > So it's not for security purposes, it's for technical reasons; you > can't override the export list externally because the information > you'd need to use the functions simply doesn't exist. well, obvious answer is that ghc should optimize according to export specs AND add original function definition to the .o file -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
2009/2/24 Bulat Ziganshin : > Hello Felipe, > > Tuesday, February 24, 2009, 11:24:19 AM, you wrote: > >> Too bad 'Map' is exported as an abstract data type and it's not >> straighforward to test this conjecture. Any ideas? > > just make a copy of its implementation to test > > btw, i always thought that it should be a way to overcome any export > lists and go directly to module internals. limiting export is the way > to protect programmer from errors, not security feature, and it should > be left to programmer to decide when he don't need it. compilers > should just be able to check whether some module/library imported > abstractly or with internals too. this will render useless all those > .Internals modules that now we need to add everywhere > I agree in principle, but GHC also uses that knowledge to optimize the code better - if a function is exported it has to produce the most polymorphic possible code for its type, if it isn't it can specialize better... that sort of thing. So it's not for security purposes, it's for technical reasons; you can't override the export list externally because the information you'd need to use the functions simply doesn't exist. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Hello Felipe, Tuesday, February 24, 2009, 11:24:19 AM, you wrote: > Too bad 'Map' is exported as an abstract data type and it's not > straighforward to test this conjecture. Any ideas? just make a copy of its implementation to test btw, i always thought that it should be a way to overcome any export lists and go directly to module internals. limiting export is the way to protect programmer from errors, not security feature, and it should be left to programmer to decide when he don't need it. compilers should just be able to check whether some module/library imported abstractly or with internals too. this will render useless all those .Internals modules that now we need to add everywhere -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
On Tue, Feb 24, 2009 at 5:36 AM, Don Stewart wrote: > This idea was the motivation for the new Seq instance, which uses > internals to build quickly. The problem is that fromDistinctAscList is the best we can do right now. We don't have something like (|>) that runs in O(1) time, and trying to insert each element would give O(n log n) instead of O(n). In fact, we don't even know if length is the culprit, although I'm highly suspicious of it. Maybe there should be Data.Map.Internal like Data.ByteString.Internal so we can mess with the datatypes directly but without strong API compatibility guarantees? -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
felipe.lessa: > On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart wrote: > > Looks like the Map reading/showing via association lists could do with > > further work. > > > > Anyone want to dig around in the Map instance? (There's also some patches > > for > > an alternative lazy Map serialisation, if people are keen to load maps -- > > happstack devs?). > > From binary-0.5: > > instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where > put m = put (Map.size m) >> mapM_ put (Map.toAscList m) > get = liftM Map.fromDistinctAscList get > > instance Binary a => Binary [a] where > put l = put (length l) >> mapM_ put l > get= do n <- get :: Get Int > replicateM n get > > > > Can't get better, I think. Now, from containers-0.2.0.0: > > fromDistinctAscList :: [(k,a)] -> Map k a > fromDistinctAscList xs > = build const (length xs) xs > where > -- 1) use continutations so that we use heap space instead of stack space. > -- 2) special case for n==5 to build bushier trees. > build c 0 xs' = c Tip xs' > build c 5 xs' = case xs' of >((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx) > -> c (bin k4 x4 (bin k2 x2 (singleton k1 > x1) (singleton k3 x3)) (singleton k5 x5)) xx >_ -> error "fromDistinctAscList build" > build c n xs' = seq nr $ build (buildR nr c) nl xs' >where > nl = n `div` 2 > nr = n - nl - 1 > > buildR n c l ((k,x):ys) = build (buildB l k x c) n ys > buildR _ _ _ [] = error "fromDistinctAscList buildR []" > buildB l k x c r zs = c (bin k x l r) zs > > > The builds seem fine, but we spot a (length xs) on the beginning. > Maybe this is the culprit? We already know the size of the map (it was > serialized), so it is just a matter of exporting > > fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a > > Too bad 'Map' is exported as an abstract data type and it's not > straighforward to test this conjecture. Any ideas? > This idea was the motivation for the new Seq instance, which uses internals to build quickly. Encoding to disk, the dictionary, $ time ./binary /usr/share/dict/cracklib-small "done" ./binary /usr/share/dict/cracklib-small 0.07s user 0.01s system 94% cpu 0.088 total Decoding, $ time ./binary dict.gz 52848 "done" ./binary dict.gz 0.07s user 0.01s system 97% cpu 0.079 total instance (Binary e) => Binary (Seq.Seq e) where put s = put (Seq.length s) >> Fold.mapM_ put s get = do n <- get :: Get Int rep Seq.empty n get where rep xs 0 _ = return $! xs rep xs n g = xs `seq` n `seq` do x <- g rep (xs Seq.|> x) (n-1) g Just a lot better. :) So ... Data.Map, we're looking at you! -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart wrote: > Looks like the Map reading/showing via association lists could do with > further work. > > Anyone want to dig around in the Map instance? (There's also some patches for > an alternative lazy Map serialisation, if people are keen to load maps -- > happstack devs?). >From binary-0.5: instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where put m = put (Map.size m) >> mapM_ put (Map.toAscList m) get = liftM Map.fromDistinctAscList get instance Binary a => Binary [a] where put l = put (length l) >> mapM_ put l get= do n <- get :: Get Int replicateM n get Can't get better, I think. Now, from containers-0.2.0.0: fromDistinctAscList :: [(k,a)] -> Map k a fromDistinctAscList xs = build const (length xs) xs where -- 1) use continutations so that we use heap space instead of stack space. -- 2) special case for n==5 to build bushier trees. build c 0 xs' = c Tip xs' build c 5 xs' = case xs' of ((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx) -> c (bin k4 x4 (bin k2 x2 (singleton k1 x1) (singleton k3 x3)) (singleton k5 x5)) xx _ -> error "fromDistinctAscList build" build c n xs' = seq nr $ build (buildR nr c) nl xs' where nl = n `div` 2 nr = n - nl - 1 buildR n c l ((k,x):ys) = build (buildB l k x c) n ys buildR _ _ _ [] = error "fromDistinctAscList buildR []" buildB l k x c r zs = c (bin k x l r) zs The builds seem fine, but we spot a (length xs) on the beginning. Maybe this is the culprit? We already know the size of the map (it was serialized), so it is just a matter of exporting fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a Too bad 'Map' is exported as an abstract data type and it's not straighforward to test this conjecture. Any ideas? -- Felipe. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
dons: > wren: > > Neil Mitchell wrote: > >> 2) The storage for String seems to be raw strings, which is nice. > >> Would I get a substantial speedup by moving to bytestrings instead of > >> strings? If I hashed the strings and stored common ones in a hash > >> table is it likely to be a big win? > > > > Bytestrings should help. The big wins in this application are likely to > > be cache issues, though the improved memory/GC overhead is nice too. > > > > Here's a quick demo using Data.Binary directly. > > Now, let's read back in and decode it back to a Map > > main = do > [f] <- getArgs > m <- decodeFile f > print (M.size (m :: M.Map B.ByteString Int)) > print "done" > > Easy enough: > > $ time ./A dict +RTS -K20M > 52848 > "done" > ./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total > Compressed dictionary is much smaller. Let's load it back in and unpickle it: > > main = do > [f] <- getArgs > m <- (decode . decompress) `fmap` L.readFile f > print (M.size (m :: M.Map B.ByteString Int)) > print "done" > > Also cute. But how does it run: > > $ time ./A dict.gz > 52848 > "done" > ./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total > > Interesting. So extracting the Map from a compressed bytestring in memory is > a fair bit faster than loading it directly, uncompressed from disk. > Note the difference, as Duncan and Bulat pointed out, is a bit surprising. Perhaps the Map instance is a bit weird? We already know that bytestring IO is fine. Just serialising straight lists of pairs, import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import qualified Data.ByteString.Lazy as L import System.Environment import qualified Data.Map as M import Codec.Compression.GZip main = do [f] <- getArgs s <- B.readFile f let m = [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . encode $ m print "done" $ time ./binary /usr/share/dict/cracklib-small "done" ./binary /usr/share/dict/cracklib-small 0.13s user 0.04s system 99% cpu 0.173 total $ du -hs dict 1.3Mdict And reading them back in, main = do [f] <- getArgs m <- decode `fmap` L.readFile f print (length (m :: [(B.ByteString,Int)])) print "done" $ time ./binary dict 52848 "done" ./binary dict 0.04s user 0.01s system 99% cpu 0.047 total Is fast. So there's some complication in the Map serialisation. Adding in zlib, to check, main = do [f] <- getArgs s <- B.readFile f let m = [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . compress . encode $ m print "done" $ time ./binary /usr/share/dict/cracklib-small "done" ./binary /usr/share/dict/cracklib-small 0.25s user 0.03s system 100% cpu 0.277 total Compression takes longer, as expected, and reading it back in, main = do [f] <- getArgs m <- (decode . decompress) `fmap` L.readFile f print (length (m :: [(B.ByteString,Int)])) print "done" $ time ./binary dict.gz 52848 "done" ./binary dict.gz 0.03s user 0.01s system 98% cpu 0.040 total About the same. Looks like the Map reading/showing via association lists could do with further work. Anyone want to dig around in the Map instance? (There's also some patches for an alternative lazy Map serialisation, if people are keen to load maps -- happstack devs?). -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
Hello Duncan, Tuesday, February 24, 2009, 5:13:05 AM, you wrote: > That's actually rather surprising. The system time is negligible and the yes, this looks highly suspicious. we use 1.5 seconds for 1.3 mb file, and 0.3 seconds for 0.3 mb file, so it looks like readFile monopolize cpu here, being able to read only 1mb per second. decoding into list of strings (instead of building map) may help to check this assumption -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
On Mon, 2009-02-23 at 17:03 -0800, Don Stewart wrote: > Here's a quick demo using Data.Binary directly. [...] > $ time ./A dict +RTS -K20M > 52848 > "done" > ./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total > > > Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map > since we know the input will be sorted, but > the Data.Binary instance can't do that. [...] > $ time ./A dict.gz > 52848 > "done" > ./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total > > Interesting. So extracting the Map from a compressed bytestring in memory is > a fair bit faster than loading it > directly, uncompressed from disk. That's actually rather surprising. The system time is negligible and the difference between total and user time does not leave much for time wasted doing i/o. So that's a real difference in user time. So what is going on? We're doing the same amount of binary decoding in each right? We're also allocating the same number of buffers, in fact slightly more in the case that uses compression. The time taken to cat a meg through an Handle using lazy bytestring is nothing. So where is all that time going? Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]
wren: > Neil Mitchell wrote: >> 2) The storage for String seems to be raw strings, which is nice. >> Would I get a substantial speedup by moving to bytestrings instead of >> strings? If I hashed the strings and stored common ones in a hash >> table is it likely to be a big win? > > Bytestrings should help. The big wins in this application are likely to > be cache issues, though the improved memory/GC overhead is nice too. > Here's a quick demo using Data.Binary directly. First, let's read in the dictionary file, and build a big, worst-case finite map of words to their occurence (always 1). import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import System.Environment import qualified Data.Map as M main = do [f] <- getArgs s <- B.readFile f let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ] encodeFile "dict" m print "done" So that writes a "dict" file with a binary encoded Map ByteString Int. Using ghc -O2 --make for everying. $ time ./A /usr/share/dict/cracklib-small "done" ./A /usr/share/dict/cracklib-small 0.28s user 0.03s system 94% cpu 0.331 total Yields a dictionary file Map: $ du -hs dict 1.3Mdict Now, let's read back in and decode it back to a Map main = do [f] <- getArgs m <- decodeFile f print (M.size (m :: M.Map B.ByteString Int)) print "done" Easy enough: $ time ./A dict +RTS -K20M 52848 "done" ./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map since we know the input will be sorted, but the Data.Binary instance can't do that. Since decode/encode are a nice pure function on lazy bytestrings, we can try a trick of compressing/decompressing the dictionary in memory. Compressing the dictionary: import Data.Binary import Data.List import qualified Data.ByteString.Char8 as B import qualified Data.ByteString.Lazy as L import System.Environment import qualified Data.Map as M import Codec.Compression.GZip main = do [f] <- getArgs s <- B.readFile f let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ] L.writeFile "dict.gz" . compress . encode $ m print "done" Pretty cool, imo, is "compress . encode": $ time ./A /usr/share/dict/cracklib-small "done" ./A /usr/share/dict/cracklib-small 0.38s user 0.02s system 85% cpu 0.470 total Ok. So building a compressed dictionary takes only a bit longer than uncompressed one (zlib is fast), $ du -hs dict.gz 216Kdict.gz Compressed dictionary is much smaller. Let's load it back in and unpickle it: main = do [f] <- getArgs m <- (decode . decompress) `fmap` L.readFile f print (M.size (m :: M.Map B.ByteString Int)) print "done" Also cute. But how does it run: $ time ./A dict.gz 52848 "done" ./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total Interesting. So extracting the Map from a compressed bytestring in memory is a fair bit faster than loading it directly, uncompressed from disk. Neil, does that give you some ballpark figures to work with? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe