Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Bertram Felgenhauer
Felipe Lessa wrote:
> On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart  wrote:
> > Looks like the Map reading/showing via association lists could do with
> > further work.
> >
> > Anyone want to dig around in the Map instance? (There's also some patches 
> > for
> > an alternative lazy Map serialisation, if people are keen to load maps -- 
> > happstack devs?).
> 
> >>From binary-0.5:
> 
> instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where
> put m = put (Map.size m) >> mapM_ put (Map.toAscList m)
> get   = liftM Map.fromDistinctAscList get
> 
> instance Binary a => Binary [a] where
> put l  = put (length l) >> mapM_ put l
> get= do n <- get :: Get Int
> replicateM n get
> 
> Can't get better, I think.

We can improve it slightly (about 20% runtime in dons example [*]):

   instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where
   get = liftM (Map.fromDistinctAscList . map strictValue) get where
  strictValue (k,v) = (v `seq` k, v)

The point is that Data.Map.Map is strict in the keys, but not in the
values of the map. In the case of deserialisation this means the values
will be thunks that hang on to the Daya.Binary buffer.

> Now, from containers-0.2.0.0:
> 
> fromDistinctAscList :: [(k,a)] -> Map k a
> fromDistinctAscList xs
>   = build const (length xs) xs
>   where
> -- 1) use continutations so that we use heap space instead of stack space.
> -- 2) special case for n==5 to build bushier trees.
> build c 0 xs'  = c Tip xs'
> build c 5 xs'  = case xs' of
>((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx)
> -> c (bin k4 x4 (bin k2 x2 (singleton k1
> x1) (singleton k3 x3)) (singleton k5 x5)) xx
>_ -> error "fromDistinctAscList build"
> build c n xs'  = seq nr $ build (buildR nr c) nl xs'
>where
>  nl = n `div` 2
>  nr = n - nl - 1
> 
> buildR n c l ((k,x):ys) = build (buildB l k x c) n ys
> buildR _ _ _ [] = error "fromDistinctAscList buildR []"
> buildB l k x c r zs = c (bin k x l r) zs
> 
> 
> The builds seem fine, but we spot a (length xs) on the beginning.
> Maybe this is the culprit? We already know the size of the map (it was
> serialized), so it is just a matter of exporting
> 
> fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a

Eliminating the 'length' call helps, too, improving runtime by
another about 5%.

The result is still a factor of 1.7 slower than reading the list of
key/value pairs.

Bertram

[*] Notes on timings: 
1) I used `rnf` for all timings, as in my previous mail.
2) I noticed that in my previous measurements, the GC time for the
   Data.Map tests was excessively large (70% and more), so I used
   +RTS -H32M this time. This resulted in a significant runtime
   improvement of about 30%.
3) Do your own measurements! Some code to play with is available here:
   http://int-e.home.tlink.de/haskell/MapTest.hs
   http://int-e.home.tlink.de/haskell/Map.hs
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Paulo Tanimoto
Hello,

On Tue, Feb 24, 2009 at 2:36 AM, Don Stewart  wrote:
> This idea was the motivation for the new Seq instance, which uses
> internals to build quickly.
>
>    Encoding to disk, the dictionary,
>
>        $ time ./binary /usr/share/dict/cracklib-small
>        "done"
>        ./binary /usr/share/dict/cracklib-small  0.07s user 0.01s system 94% 
> cpu 0.088 total
>
>    Decoding,
>        $ time ./binary dict.gz
>        52848
>        "done"
>        ./binary dict.gz  0.07s user 0.01s system 97% cpu 0.079 total
>
>    instance (Binary e) => Binary (Seq.Seq e) where
>        put s = put (Seq.length s) >> Fold.mapM_ put s
>        get = do n <- get :: Get Int
>                 rep Seq.empty n get
>          where rep xs 0 _ = return $! xs
>                rep xs n g = xs `seq` n `seq` do
>                               x <- g
>                               rep (xs Seq.|> x) (n-1) g
>
>
> Just a lot better. :)
>
> So ... Data.Map, we're looking at you!

Indeed, that was the motivation for writing the patch for Seq.  [Ross,
thank you again for the help.]  I had performance issues with lists,
but noticed that switching to Sequence wasn't helping at all.  This
new definition takes advantage of the features that Seq has and List
doesn't.

Regarding Map, I like Felipe's idea of having a separate Internal.

I know that Lemmih has a few other implementations of Map (compactMap,
BerkeleyDB).  If I remember correctly, he made BerkeleyDB an instance
of Binary.  That can probably give us some insight too.

Paulo
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Bertram Felgenhauer
Don Stewart wrote:
> dons:
[...]
> Just serialising straight lists of pairs,
[...]
> And reading them back in,
> 
> main = do
> [f] <- getArgs
> m <- decode `fmap` L.readFile f
> print (length (m :: [(B.ByteString,Int)]))
> print "done"

Well, you don't actually read the whole list here, just its length:

instance Binary a => Binary [a] where
put l  = put (length l) >> mapM_ put l
get= do n <- get :: Get Int
replicateM n get

To demonstrate, this works:

main = do
L.writeFile "v" (encode (42 :: Int))
m <- decode `fmap` L.readFile "v"
print (length (m :: [Int]))

So instead, we should try something like this:

import Control.Parallel.Strategies

instance NFData B.ByteString where
rnf bs = bs `seq` ()

main = do
[f] <- getArgs
m <- decode `fmap` L.readFile f
print (rnf m `seq` length (m :: [(B.ByteString,Int)]))

My timings:

reading list, without rnf:
0.04s
with rnf:
0.16s
reading a Data.Map:
0.52s
with rnf:
0.62s

Bertram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe]Data.Binary poor read performance]

2009-02-24 Thread Claus Reinke

btw, i always thought that it should be a way to overcome any export
lists and go directly to module internals. limiting export is the way
to protect programmer from errors, not security feature, and it should
be left to programmer to decide when he don't need it. compilers
should just be able to check whether some module/library imported
abstractly or with internals too. this will render useless all those
.Internals modules that now we need to add everywhere


You're not alone!-) This has been called "Open Implementation" 
(OI, a pre-cursor of aspect-oriented programming):


   http://www2.parc.com/csl/groups/sda/projects/oi/

They argue for an explicit auxiliary interface instead of full access 
to module internals. Since these same folks worked on meta-object 
protocols as well (at the meta-level, the boundaries can be bypassed 
entirely), that suggestion probably comes from experience. They do 
allow for the auxiliary interface to use meta-programming style 
features, though that depends on the language in question (in 
Haskell, type classes or type functions might be used instead,

but rewrite rules and Template Haskell are also available).

   Open Implementation Design Guidelines 
   http://www2.parc.com/csl/groups/sda/publications/papers/Kiczales-ICSE97/


is a short paper discussing a Set API/Open Implementation example.


I agree in principle, but GHC also uses that knowledge to optimize the
code better - if a function is exported it has to produce the most
polymorphic possible code for its type, if it isn't it can specialize
better... that sort of thing.


That refers to optimization in the provider module. As the OI
people argued, optimization in the client modules also needs to
be taken into account. If the default one-size-fits-all-uses
implementation behind the default API doesn't work well 
enough, there'll be a proliferation of library variants. If there 
is a way to fine-tune the implementation via an auxiliary API,
much of that can be avoided. 

In other words, instead of half a dozen Maps and a dozen 
Array variants, there'd just be one of each, but with auxiliary 
interfaces that would allow client code to choose and tune 
the most suitable implementations behind the main interfaces.


It isn't magic, though: if, say, even the auxiliary API doesn't
allow you to say "thanks, but I know the length already", you're
still stuck. But it does help to think about designing 2 APIs
(the default functionality one and the auxiliary fine-tuning one)
instead of offering only the choice of 1 API or full access to 
internals.


Claus

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Export restrictions :) Re[4]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Bulat Ziganshin
Hello Svein,

Tuesday, February 24, 2009, 3:47:44 PM, you wrote:

>> btw, i always thought that it should be a way to overcome any export
>> lists and go directly to module internals. limiting export is the way
>> to protect programmer from errors, not security feature, and it should
>> be left to programmer to decide when he don't need it. compilers
>> should just be able to check whether some module/library imported
>> abstractly or with internals too. this will render useless all those
>> .Internals modules that now we need to add everywhere
>>
> I agree in principle, but GHC also uses that knowledge to optimize the
> code better - if a function is exported it has to produce the most
> polymorphic possible code for its type, if it isn't it can specialize
> better... that sort of thing.

> So it's not for security purposes, it's for technical reasons; you
> can't override the export list externally because the information
> you'd need to use the functions simply doesn't exist.

well, obvious answer is that ghc should optimize according to export
specs AND add original function definition to the .o file

-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Svein Ove Aas
2009/2/24 Bulat Ziganshin :
> Hello Felipe,
>
> Tuesday, February 24, 2009, 11:24:19 AM, you wrote:
>
>> Too bad 'Map' is exported as an abstract data type and it's not
>> straighforward to test this conjecture. Any ideas?
>
> just make a copy of its implementation to test
>
> btw, i always thought that it should be a way to overcome any export
> lists and go directly to module internals. limiting export is the way
> to protect programmer from errors, not security feature, and it should
> be left to programmer to decide when he don't need it. compilers
> should just be able to check whether some module/library imported
> abstractly or with internals too. this will render useless all those
> .Internals modules that now we need to add everywhere
>
I agree in principle, but GHC also uses that knowledge to optimize the
code better - if a function is exported it has to produce the most
polymorphic possible code for its type, if it isn't it can specialize
better... that sort of thing.

So it's not for security purposes, it's for technical reasons; you
can't override the export list externally because the information
you'd need to use the functions simply doesn't exist.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Bulat Ziganshin
Hello Felipe,

Tuesday, February 24, 2009, 11:24:19 AM, you wrote:

> Too bad 'Map' is exported as an abstract data type and it's not
> straighforward to test this conjecture. Any ideas?

just make a copy of its implementation to test

btw, i always thought that it should be a way to overcome any export
lists and go directly to module internals. limiting export is the way
to protect programmer from errors, not security feature, and it should
be left to programmer to decide when he don't need it. compilers
should just be able to check whether some module/library imported
abstractly or with internals too. this will render useless all those
.Internals modules that now we need to add everywhere


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Felipe Lessa
On Tue, Feb 24, 2009 at 5:36 AM, Don Stewart  wrote:
> This idea was the motivation for the new Seq instance, which uses
> internals to build quickly.

The problem is that fromDistinctAscList is the best we can do right
now. We don't have something like (|>) that runs in O(1) time, and
trying to insert each element would give O(n log n) instead of O(n).
In fact, we don't even know if length is the culprit, although I'm
highly suspicious of it.

Maybe there should be Data.Map.Internal like Data.ByteString.Internal
so we can mess with the datatypes directly but without strong API
compatibility guarantees?

-- 
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Don Stewart
felipe.lessa:
> On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart  wrote:
> > Looks like the Map reading/showing via association lists could do with
> > further work.
> >
> > Anyone want to dig around in the Map instance? (There's also some patches 
> > for
> > an alternative lazy Map serialisation, if people are keen to load maps -- 
> > happstack devs?).
> 
> From binary-0.5:
> 
> instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where
> put m = put (Map.size m) >> mapM_ put (Map.toAscList m)
> get   = liftM Map.fromDistinctAscList get
> 
> instance Binary a => Binary [a] where
> put l  = put (length l) >> mapM_ put l
> get= do n <- get :: Get Int
> replicateM n get
> 
> 
> 
> Can't get better, I think. Now, from containers-0.2.0.0:
> 
> fromDistinctAscList :: [(k,a)] -> Map k a
> fromDistinctAscList xs
>   = build const (length xs) xs
>   where
> -- 1) use continutations so that we use heap space instead of stack space.
> -- 2) special case for n==5 to build bushier trees.
> build c 0 xs'  = c Tip xs'
> build c 5 xs'  = case xs' of
>((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx)
> -> c (bin k4 x4 (bin k2 x2 (singleton k1
> x1) (singleton k3 x3)) (singleton k5 x5)) xx
>_ -> error "fromDistinctAscList build"
> build c n xs'  = seq nr $ build (buildR nr c) nl xs'
>where
>  nl = n `div` 2
>  nr = n - nl - 1
> 
> buildR n c l ((k,x):ys) = build (buildB l k x c) n ys
> buildR _ _ _ [] = error "fromDistinctAscList buildR []"
> buildB l k x c r zs = c (bin k x l r) zs
> 
> 
> The builds seem fine, but we spot a (length xs) on the beginning.
> Maybe this is the culprit? We already know the size of the map (it was
> serialized), so it is just a matter of exporting
> 
> fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a
> 
> Too bad 'Map' is exported as an abstract data type and it's not
> straighforward to test this conjecture. Any ideas?
> 

This idea was the motivation for the new Seq instance, which uses
internals to build quickly.

Encoding to disk, the dictionary,

$ time ./binary /usr/share/dict/cracklib-small
"done"
./binary /usr/share/dict/cracklib-small  0.07s user 0.01s system 94% 
cpu 0.088 total

Decoding,
$ time ./binary dict.gz
52848
"done"
./binary dict.gz  0.07s user 0.01s system 97% cpu 0.079 total

instance (Binary e) => Binary (Seq.Seq e) where
put s = put (Seq.length s) >> Fold.mapM_ put s
get = do n <- get :: Get Int
 rep Seq.empty n get
  where rep xs 0 _ = return $! xs
rep xs n g = xs `seq` n `seq` do
   x <- g
   rep (xs Seq.|> x) (n-1) g


Just a lot better. :)

So ... Data.Map, we're looking at you!

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Felipe Lessa
On Tue, Feb 24, 2009 at 4:59 AM, Don Stewart  wrote:
> Looks like the Map reading/showing via association lists could do with
> further work.
>
> Anyone want to dig around in the Map instance? (There's also some patches for
> an alternative lazy Map serialisation, if people are keen to load maps -- 
> happstack devs?).

>From binary-0.5:

instance (Ord k, Binary k, Binary e) => Binary (Map.Map k e) where
put m = put (Map.size m) >> mapM_ put (Map.toAscList m)
get   = liftM Map.fromDistinctAscList get

instance Binary a => Binary [a] where
put l  = put (length l) >> mapM_ put l
get= do n <- get :: Get Int
replicateM n get



Can't get better, I think. Now, from containers-0.2.0.0:

fromDistinctAscList :: [(k,a)] -> Map k a
fromDistinctAscList xs
  = build const (length xs) xs
  where
-- 1) use continutations so that we use heap space instead of stack space.
-- 2) special case for n==5 to build bushier trees.
build c 0 xs'  = c Tip xs'
build c 5 xs'  = case xs' of
   ((k1,x1):(k2,x2):(k3,x3):(k4,x4):(k5,x5):xx)
-> c (bin k4 x4 (bin k2 x2 (singleton k1
x1) (singleton k3 x3)) (singleton k5 x5)) xx
   _ -> error "fromDistinctAscList build"
build c n xs'  = seq nr $ build (buildR nr c) nl xs'
   where
 nl = n `div` 2
 nr = n - nl - 1

buildR n c l ((k,x):ys) = build (buildB l k x c) n ys
buildR _ _ _ [] = error "fromDistinctAscList buildR []"
buildB l k x c r zs = c (bin k x l r) zs


The builds seem fine, but we spot a (length xs) on the beginning.
Maybe this is the culprit? We already know the size of the map (it was
serialized), so it is just a matter of exporting

fromDistinctAscSizedList :: Int -> [(k, a)] -> Map k a

Too bad 'Map' is exported as an abstract data type and it's not
straighforward to test this conjecture. Any ideas?

-- 
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-24 Thread Don Stewart
dons:
> wren:
> > Neil Mitchell wrote:
> >> 2) The storage for String seems to be raw strings, which is nice.
> >> Would I get a substantial speedup by moving to bytestrings instead of
> >> strings? If I hashed the strings and stored common ones in a hash
> >> table is it likely to be a big win?
> >
> > Bytestrings should help. The big wins in this application are likely to  
> > be cache issues, though the improved memory/GC overhead is nice too.
> >
> 
> Here's a quick demo using Data.Binary directly.
> 
> Now, let's read back in and decode it back to a Map 
> 
> main = do
> [f] <- getArgs
> m   <- decodeFile f
> print (M.size (m :: M.Map B.ByteString Int))
> print "done"
> 
> Easy enough:
> 
> $ time ./A dict +RTS -K20M
> 52848
> "done"
> ./A dict +RTS -K20M  1.51s user 0.06s system 99% cpu 1.582 total


  
> Compressed dictionary is much smaller. Let's load it back in and unpickle it:
> 
> main = do
> [f] <- getArgs
> m <- (decode . decompress) `fmap` L.readFile f
> print (M.size (m :: M.Map B.ByteString Int))
> print "done"
> 
> Also cute. But how does it run:
> 
> $ time ./A dict.gz
> 52848
> "done"
> ./A dict.gz  0.28s user 0.03s system 98% cpu 0.310 total
> 
> Interesting. So extracting the Map from a compressed bytestring in memory is
> a fair bit faster than loading it  directly, uncompressed from disk.
> 


Note the difference, as Duncan and Bulat pointed out, is a bit
surprising. Perhaps the Map instance is a bit weird? We already know
that bytestring IO is fine.

Just serialising straight lists of pairs,

import Data.Binary
import Data.List
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy  as L
import System.Environment
import qualified Data.Map as M
import Codec.Compression.GZip

main = do
[f] <- getArgs
s   <- B.readFile f
let m = [ (head n, length n) | n <- (group . B.lines $ s) ]
L.writeFile "dict.gz" . encode $ m
print "done"

$ time ./binary /usr/share/dict/cracklib-small
"done"
./binary /usr/share/dict/cracklib-small  0.13s user 0.04s system 99% cpu
0.173 total

$ du -hs dict 
1.3Mdict

And reading them back in,

main = do
[f] <- getArgs
m <- decode `fmap` L.readFile f
print (length (m :: [(B.ByteString,Int)]))
print "done"

$ time ./binary dict
52848
"done"
./binary dict  0.04s user 0.01s system 99% cpu 0.047 total

Is fast. So there's some complication in the Map serialisation. Adding in zlib,
to check,

main = do
[f] <- getArgs
s   <- B.readFile f
let m = [ (head n, length n) | n <- (group . B.lines $ s) ]
L.writeFile "dict.gz" . compress . encode $ m
print "done"

$ time ./binary /usr/share/dict/cracklib-small 
"done"
./binary /usr/share/dict/cracklib-small  0.25s user 0.03s system
100% cpu 0.277 total

Compression takes longer, as expected, and reading it back in,

main = do
[f] <- getArgs
m <- (decode . decompress) `fmap` L.readFile f
print (length (m :: [(B.ByteString,Int)]))
print "done"

$ time ./binary dict.gz
52848
"done"
./binary dict.gz  0.03s user 0.01s system 98% cpu 0.040 total

About the same.

Looks like the Map reading/showing via association lists could do with
further work. 

Anyone want to dig around in the Map instance? (There's also some patches for
an alternative lazy Map serialisation, if people are keen to load maps -- 
happstack devs?).

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-23 Thread Bulat Ziganshin
Hello Duncan,

Tuesday, February 24, 2009, 5:13:05 AM, you wrote:

> That's actually rather surprising. The system time is negligible and the

yes, this looks highly suspicious. we use 1.5 seconds for 1.3 mb file,
and 0.3 seconds for 0.3 mb file, so it looks like readFile monopolize
cpu here, being able to read only 1mb per second. decoding into list
of strings (instead of building map) may help to check this assumption


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-23 Thread Duncan Coutts
On Mon, 2009-02-23 at 17:03 -0800, Don Stewart wrote:

> Here's a quick demo using Data.Binary directly.

[...]

> $ time ./A dict +RTS -K20M
> 52848
> "done"
> ./A dict +RTS -K20M  1.51s user 0.06s system 99% cpu 1.582 total
> 
> 
> Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map 
> since we know the input will be sorted, but
> the Data.Binary instance can't do that.

[...]

> $ time ./A dict.gz
> 52848
> "done"
> ./A dict.gz  0.28s user 0.03s system 98% cpu 0.310 total
> 
> Interesting. So extracting the Map from a compressed bytestring in memory is 
> a fair bit faster than loading it 
> directly, uncompressed from disk.

That's actually rather surprising. The system time is negligible and the
difference between total and user time does not leave much for time
wasted doing i/o. So that's a real difference in user time. So what is
going on? We're doing the same amount of binary decoding in each right?
We're also allocating the same number of buffers, in fact slightly more
in the case that uses compression. The time taken to cat a meg through
an Handle using lazy bytestring is nothing. So where is all that time
going?

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

2009-02-23 Thread Don Stewart
wren:
> Neil Mitchell wrote:
>> 2) The storage for String seems to be raw strings, which is nice.
>> Would I get a substantial speedup by moving to bytestrings instead of
>> strings? If I hashed the strings and stored common ones in a hash
>> table is it likely to be a big win?
>
> Bytestrings should help. The big wins in this application are likely to  
> be cache issues, though the improved memory/GC overhead is nice too.
>

Here's a quick demo using Data.Binary directly.

First, let's read in the dictionary file, and build a big, worst-case finite
map of words to their occurence (always 1).

import Data.Binary
import Data.List
import qualified Data.ByteString.Char8 as B
import System.Environment
import qualified Data.Map as M

main = do
[f] <- getArgs
s   <- B.readFile f
let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
encodeFile "dict" m
print "done"

So that writes a "dict" file with a binary encoded Map ByteString Int.
Using ghc -O2 --make for everying.

$ time ./A /usr/share/dict/cracklib-small
"done"
./A /usr/share/dict/cracklib-small  0.28s user 0.03s system 94% cpu 0.331 
total

Yields a dictionary file Map:

$ du -hs dict
1.3Mdict

Now, let's read back in and decode it back to a Map 

main = do
[f] <- getArgs
m   <- decodeFile f
print (M.size (m :: M.Map B.ByteString Int))
print "done"

Easy enough:

$ time ./A dict +RTS -K20M
52848
"done"
./A dict +RTS -K20M  1.51s user 0.06s system 99% cpu 1.582 total


Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map 
since we know the input will be sorted, but
the Data.Binary instance can't do that.

Since decode/encode are a nice pure function on lazy bytestrings, we can try a 
trick of 
compressing/decompressing the dictionary in memory.

Compressing the dictionary:

import Data.Binary
import Data.List
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy  as L
import System.Environment
import qualified Data.Map as M
import Codec.Compression.GZip

main = do
[f] <- getArgs
s   <- B.readFile f
let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
L.writeFile "dict.gz" . compress . encode $ m
print "done"

Pretty cool, imo, is "compress . encode":

$ time ./A /usr/share/dict/cracklib-small 
"done"
./A /usr/share/dict/cracklib-small  0.38s user 0.02s system 85% cpu 0.470 
total

Ok. So building a compressed dictionary takes only a bit longer than 
uncompressed one (zlib is fast),

$ du -hs dict.gz 
216Kdict.gz

Compressed dictionary is much smaller. Let's load it back in and unpickle it:

main = do
[f] <- getArgs
m <- (decode . decompress) `fmap` L.readFile f
print (M.size (m :: M.Map B.ByteString Int))
print "done"

Also cute. But how does it run:

$ time ./A dict.gz
52848
"done"
./A dict.gz  0.28s user 0.03s system 98% cpu 0.310 total

Interesting. So extracting the Map from a compressed bytestring in memory is a 
fair bit faster than loading it 
directly, uncompressed from disk.

Neil, does that give you some ballpark figures to work with?

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe