Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-17 Thread Peter Lund
On Sun, 2007-12-16 at 15:21 -0800, Don Stewart wrote:

 An updated bytestring library is at :
 
 
 http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.2
 
 Enjoy! :)

Thanks!

-Peter


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-16 Thread Roman Leshchinskiy

Don Stewart wrote:


 cnt:: B.ByteString - Int64
 cnt bs = B.length (B.filter (== ' ') bs)


 [...]
 
Now, this memory result is suspicious, I wonder if the now obsolete 'array fusion'

is messing things up. In Data.ByteString.Lazy, we have:


Are you sure you have a fusible length? I think I only added it to NDP 
after stream fusion went in.


Roman

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-16 Thread Don Stewart
rl:
 Don Stewart wrote:
 
  cnt:: B.ByteString - Int64
  cnt bs = B.length (B.filter (== ' ') bs)
 
  [...]
  
 Now, this memory result is suspicious, I wonder if the now obsolete 'array 
 fusion'
 is messing things up. In Data.ByteString.Lazy, we have:
 
 Are you sure you have a fusible length? I think I only added it to NDP 
 after stream fusion went in.

It was the array fusion from prior to the stream stuff (foldl . filter).
Which was in fact messing up the simplifier. I've fixed this, (turned
off array fusion for now), and things are back to normal. (well, much
faster, actually).

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-16 Thread Don Stewart
I've had a look at how some of the code was being compiled for 
strict and lazy bytestrings, and also which rules weren't firing.
With some small tweaks the code seems back in good shape.

An updated bytestring library is at :


http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.2

Enjoy! :)



Summary: the suspicious lazy bytestring program works now. (constant
space, and fastest overall, as expected originally)

Program 1, lazy bytestring length . filter

Yesterday:
./A +RTS -sstderr  150M  1.01s user 0.10s system 98% cpu 1.123 total
40M allocated

  * Today (fixed!):
./A +RTS -sstderr  150M  0.26s user 0.06s system 96% cpu 0.332 total
2M allocated

Reason, deprecated array fusion mucking up the optimiser.

I think we can close this regression.



Also, I had a look at Program 3: lazy bytestring, custom loop

Unchanged. 2.4s, constant space. This was a bit slow.

Further investigation shows lots of unnecessary bounds checks, as we
take apart the Chunk lazy bytestring type, then test and continue.

This representation was chosen to make it possible to process chunks
efficiently, so that we can avoid these bounds check. 

Something like this instead:

cnt :: Int - B.ByteString - Int
cnt n B.Empty= n
cnt n (B.Chunk x xs) = cnt (n + cnt_strict 0 x) xs  -- process lazy 
spine

-- now we can process a chunk without checking for Empty
where
cnt_string !i !s-- then strict 
chunk 
| S.null s  = i
| c == ' '  = cnt_strict (i+1) t
| otherwise = cnt_strict i t
  where
(c,t) = (S.w2c (S.unsafeHead s), S.unsafeTail s) -- no 
bounds check

main = do s - B.getContents; print (cnt 0 s)

Let's us avoid redundant checks for Empty, while allowing 'go' to
avoid unnecessary checks for the empty strict bytestring. This is 
some 4x faster.

This alternating between lazy spines and strict chunk processing is
the best way to get reliable performance from lazy bytestring custom loops.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
What do you think the relative speeds are of the six small haskell
programs at the end of this email?

All they do is read from stdin and count the number of spaces they see.
There are two that use strict bytestrings, two that use lazy
bytestrings, and two that use the standard Haskell strings.  Three use a
recursive function with an accumulator parameter and three use a foldl
with a lambda function.

Say the fastest one takes the time 1.  How much time will the others
take?

And how about memory?  How much memory do you think they require?  Let's
say we feed a 150MB(*) file into each of them, how many megabytes do you
think they end up using (as seen from the OS, not in terms of how big
the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

-Peter

*) hardddisk megabytes.  The file is 15034 bytes ≈ 143 mebibytes.


PS: For extra credit, what do you think is the peak memory use for this
program when given an input file of 150MB?


{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B
import GHC.Int (Int64)

-- note that D.BS.Lazy.Char8.length is ByteString - Int64
--   D.BS.C8.length is ByteString - Int
cnt :: B.ByteString - Int64
cnt bs  = B.length (B.filter (== ' ') bs)

main = do s - B.getContents
  print (cnt s)




==
hs/space-bs-c8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: Int - B.ByteString - Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s - B.getContents
  print (cnt 0 s)
==
hs/space-bslc8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: Int - B.ByteString - Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s - B.getContents
  print (cnt 0 s)
==
hs/space-x-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

cnt :: Int - String - Int
cnt !acc bs = if null bs
then acc
else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)

main = do s - getContents
  print (cnt 0 s)
==
hs/space-bs-c8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: B.ByteString - Int
cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - B.getContents
  print (cnt s)
==
hs/space-bslc8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: B.ByteString - Int
cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - B.getContents
  print (cnt s)
==
hs/space-x-foldl.hs:
{-# LANGUAGE BangPatterns #-}

cnt :: String - Int
cnt bs  = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - getContents
  print (cnt s)
==


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Duncan Coutts

On Sat, 2007-12-15 at 09:25 +0100, Peter Lund wrote:
 What do you think the relative speeds are of the six small haskell
 programs at the end of this email?

Ok, I presume this is a guessing game and we're supposed to just look at
the code without running and timing them.

 All they do is read from stdin and count the number of spaces they see.
 There are two that use strict bytestrings, two that use lazy
 bytestrings, and two that use the standard Haskell strings.  Three use a
 recursive function with an accumulator parameter and three use a foldl
 with a lambda function.
 
 Say the fastest one takes the time 1.  How much time will the others
 take?
 
 And how about memory?  How much memory do you think they require?  Let's
 say we feed a 150MB(*) file into each of them, how many megabytes do you
 think they end up using (as seen from the OS, not in terms of how big
 the live heap is)?
 
 I'm going to post full benchmarks + analysis on Wednesday.

Right'o. I'll have a go. Lets see if I can't embarrass myself with being
completely inaccurate.


 PS: For extra credit, what do you think is the peak memory use for this
 program when given an input file of 150MB?

Ok.

 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Lazy.Char8 as B
 import GHC.Int (Int64)
 
 -- note that D.BS.Lazy.Char8.length is ByteString - Int64
 --   D.BS.C8.length is ByteString - Int

Yes, because strict bytestring cannot be bigger than the size of virtual
memory and with ghc at least, Int tracks the size of the machine
pointer.

 cnt   :: B.ByteString - Int64
 cnt bs= B.length (B.filter (== ' ') bs)
 
 main = do s - B.getContents
 print (cnt s)

Hmm. So that should work in constant memory, a few 64 chunks at once.
I'd expect this to be pretty fast.

 
 
 
 ==
 hs/space-bs-c8-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Char8 as B
 
 cnt   :: Int - B.ByteString - Int
 cnt !acc bs = if B.null bs
   then acc
   else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
 
 main = do s - B.getContents
 print (cnt 0 s)

This uses strict bytestrings so will use at least 150Mb and that'll make
it a good deal slower. In fact it'll be worse than that since
getContents does not know in advance how big the input will be so it has
to play the doubling and copying game. So it'll end up copying all the
data roughly twice. cnt is strict and tail recursive so that shouldn't
be any problem, though it's probably not as fast as the first length .
filter since head, tail, null all have to do bounds checks.

 ==
 hs/space-bslc8-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Lazy.Char8 as B
 
 cnt   :: Int - B.ByteString - Int
 cnt !acc bs = if B.null bs
   then acc
   else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
 
 main = do s - B.getContents
 print (cnt 0 s)

For the same reason as above, I'd expect this cnt to be slower than
B.length . B.filter (== ' ')

 ==
 hs/space-x-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 cnt   :: Int - String - Int
 cnt !acc bs = if null bs
   then acc
   else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
 
 main = do s - getContents
 print (cnt 0 s)

Lazy, so constant memory use, but much higher constant factors due to
using String.

 ==
 hs/space-bs-c8-foldlx-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Char8 as B
 
 cnt   :: B.ByteString - Int
 cnt bs= B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
 
 main = do s - B.getContents
 print (cnt s)

This is of course still strict so that's going to make the reading slow.

This is a manually fused B.length . B.filter (== ' ') which hopefully is
the same speed as the automatically fused one if the fusion is working
ok. If not, then the B.length . B.filter (== ' ') will be doing a extra
copy, and memory writes are expensive.

 ==
 hs/space-bslc8-foldlx-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Lazy.Char8 as B
 
 cnt   :: B.ByteString - Int
 cnt bs= B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
 
 main = do s - B.getContents
 print (cnt s)

As above but now in constant memory space.

 ==
 hs/space-x-foldl.hs:
 {-# LANGUAGE BangPatterns #-}
 
 cnt   :: String - Int
 cnt bs= foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs
 
 main = do s - getContents
 print (cnt s)

Oh, no! not foldl that's a killer.


Ok, so best way to summarise I think is to organise by data type since I
think that'll dominate.

So I think the lazy bytestring versions will be fastest due to having
the best memory access patterns and doing the least copying. I think the
foldl's will 

Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 14:34 +, Duncan Coutts wrote:

 Ok, I presume this is a guessing game and we're supposed to just look at
 the code without running and timing them.

Precisely :)

  All they do is read from stdin and count the number of spaces they see.
  There are two that use strict bytestrings, two that use lazy
  bytestrings, and two that use the standard Haskell strings.  Three use a
  recursive function with an accumulator parameter and three use a foldl
  with a lambda function.
  
  Say the fastest one takes the time 1.  How much time will the others
  take?
  
  And how about memory?  How much memory do you think they require?  Let's
  say we feed a 150MB(*) file into each of them, how many megabytes do you
  think they end up using (as seen from the OS, not in terms of how big
  the live heap is)?
  
  I'm going to post full benchmarks + analysis on Wednesday.
 
 Right'o. I'll have a go. Lets see if I can't embarrass myself with being
 completely inaccurate.

Thanks for biting!

You were, thankfully, only almost completely inaccurate ;)

  PS: For extra credit, what do you think is the peak memory use for this
  program when given an input file of 150MB?

 Hmm. So that should work in constant memory, a few 64 chunks at once.
 I'd expect this to be pretty fast.

You are right about the speed.
Can you guess a number in kilobytes?

  
  
  
  ==
  hs/space-bs-c8-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Char8 as B
  
  cnt :: Int - B.ByteString - Int
  cnt !acc bs = if B.null bs
  then acc
  else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
  
  main = do s - B.getContents
print (cnt 0 s)
 
 This uses strict bytestrings so will use at least 150Mb and that'll make
 it a good deal slower. In fact it'll be worse than that since
 getContents does not know in advance how big the input will be so it has
 to play the doubling and copying game. So it'll end up copying all the
 data roughly twice. cnt is strict and tail recursive so that shouldn't
 be any problem, though it's probably not as fast as the first length .
 filter since head, tail, null all have to do bounds checks.

You are right about the memory.  It is actually slightly faster than the
extra credit (length/filter combination) above.

  ==
  hs/space-bslc8-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Lazy.Char8 as B
  
  cnt :: Int - B.ByteString - Int
  cnt !acc bs = if B.null bs
  then acc
  else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
  
  main = do s - B.getContents
print (cnt 0 s)
 
 For the same reason as above, I'd expect this cnt to be slower than
 B.length . B.filter (== ' ')

It is slower but not for the same reason as above.

  ==
  hs/space-x-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  cnt :: Int - String - Int
  cnt !acc bs = if null bs
  then acc
  else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
  
  main = do s - getContents
print (cnt 0 s)
 
 Lazy, so constant memory use, but much higher constant factors due to
 using String.

Spot on.

  ==
  hs/space-bs-c8-foldlx-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Char8 as B
  
  cnt :: B.ByteString - Int
  cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - B.getContents
print (cnt s)
 
 This is of course still strict so that's going to make the reading slow.

Nope.

 This is a manually fused B.length . B.filter (== ' ') which hopefully is
 the same speed as the automatically fused one if the fusion is working
 ok. If not, then the B.length . B.filter (== ' ') will be doing a extra
 copy, and memory writes are expensive.
 
  ==
  hs/space-bslc8-foldlx-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Lazy.Char8 as B
  
  cnt :: B.ByteString - Int
  cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - B.getContents
print (cnt s)
 
 As above but now in constant memory space.

Nope.

  ==
  hs/space-x-foldl.hs:
  {-# LANGUAGE BangPatterns #-}
  
  cnt :: String - Int
  cnt bs  = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - getContents
print (cnt s)
 
 Oh, no! not foldl that's a killer.

You think it's worse than the program just above?

 
 Ok, so best way to summarise I think is to organise by data type since I
 think that'll dominate.
 
 So I think the lazy bytestring versions will be fastest due to having
 the best memory access patterns and doing the least copying. I think the
 foldl's will be faster than the explicit accumulators due to having
 fewer bounds checks.
 
 space-bslc8-foldlx-1
 

Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Don Stewart
firefly:
 What do you think the relative speeds are of the six small haskell
 programs at the end of this email?
 
 All they do is read from stdin and count the number of spaces they see.
 There are two that use strict bytestrings, two that use lazy
 bytestrings, and two that use the standard Haskell strings.  Three use a
 recursive function with an accumulator parameter and three use a foldl
 with a lambda function.
 
 Say the fastest one takes the time 1.  How much time will the others
 take?
 
 And how about memory?  How much memory do you think they require?  Let's
 say we feed a 150MB(*) file into each of them, how many megabytes do you
 think they end up using (as seen from the OS, not in terms of how big
 the live heap is)?
 
 I'm going to post full benchmarks + analysis on Wednesday.

How are you compiling these programs, by the way?  ghc-6.8.2 -O2  ?
(-O2 is required for bytestrings :)

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 11:59 -0800, Don Stewart wrote:
 firefly:
  What do you think the relative speeds are of the six small haskell
  programs at the end of this email?
  
  All they do is read from stdin and count the number of spaces they see.
  There are two that use strict bytestrings, two that use lazy
  bytestrings, and two that use the standard Haskell strings.  Three use a
  recursive function with an accumulator parameter and three use a foldl
  with a lambda function.
  
  Say the fastest one takes the time 1.  How much time will the others
  take?
  
  And how about memory?  How much memory do you think they require?  Let's
  say we feed a 150MB(*) file into each of them, how many megabytes do you
  think they end up using (as seen from the OS, not in terms of how big
  the live heap is)?
  
  I'm going to post full benchmarks + analysis on Wednesday.
 
 How are you compiling these programs, by the way?  ghc-6.8.2 -O2  ?
 (-O2 is required for bytestrings :)

With -O2.  I have measured with 6.8.1, 6.9.20071119, 6.9.20071208
(approx), and 6.9.200712xx (as of yesterday or today).  The picture
changes very little with the compiler, if it changes at all.

I have run them on three very different microarchitectures (2GHz
Athlon64 3000+, 1667MHz Core Duo, 600MHz Pentium III).

All the measurements are scripted.  'make phase1' compiles the
benchmarks, creates input files of various sizes, and runs each
benchmark once to gather information about memory use (peak RSS + ghc
RTS' own information about allocations and gcs), page faults, and an
strace.  This phase is not timing sensitive so I can browse the web and
listen to the music etc. while running it.

'make phase2' runs each benchmark a number of times, calculates the
average time for each + the relative size of the standard deviation +
how much user+sys is different from real (as reported by bash' built-in
time command).  A report with barcharts indicating relative time and
relative peak RSS is generated in either pure ASCII or in UTF-8 (with
fractional-width block chars so the charts look nice and have high
resolution).  If the measurements are deemed to be bad (too high
standard deviation or user+sys doesn't add up to real) then the barchart
is done with '%' characters.  The quality indicators for each timing
test are always printed out next to each bar, so we know how close we
are to being perfect or, conversely, how bad the measurements are.

There is a script that analyzes the I/O pattern and sums it up (as 4375
x read(3, 4096, ...) = 4096  followed by 1 x read(3, 4096, ...) = 1242
followed by 1 x read(3, 4096, ...) = 0 an similar).

There are a set of simple I/O programs in C so we can compare ghc's
performance with speed of light, and so different I/O strategies can
be compared in a cleaner, purer form.

There are also 'make cache' (runs cachegrind), 'make
hs/.doc' (creates a file with source code + core + stg + c-- +
assembly + times for a given benchmark), etc.

'make sysinfo' creates a file with information about the Linux
distribution used (/etc/lsb-release), kernel version (uname -a), CPU
(/proc/cpuinfo), and compilers used (ghc --version, gcc --version).

'make zipdata' creates a zip file of about 20K with all the raw time
measurements + the sysinfo.

I also have a set of scripts that installs each ghc version so 1) it is
easy for me to repeat the tests on various machines and 2) you can see
exactly which ghc versions I use and repeat the experiments yourself.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Don Stewart
firefly:
 What do you think the relative speeds are of the six small haskell
 programs at the end of this email?
 
 All they do is read from stdin and count the number of spaces they see.
 There are two that use strict bytestrings, two that use lazy
 bytestrings, and two that use the standard Haskell strings.  Three use a
 recursive function with an accumulator parameter and three use a foldl
 with a lambda function.
 
 Say the fastest one takes the time 1.  How much time will the others
 take?
 
 And how about memory?  How much memory do you think they require?  Let's
 say we feed a 150MB(*) file into each of them, how many megabytes do you
 think they end up using (as seen from the OS, not in terms of how big
 the live heap is)?
 
 I'm going to post full benchmarks + analysis on Wednesday.
 
 -Peter
 
 *) hardddisk megabytes.  The file is 15034 bytes ≈ 143 mebibytes.
 
 
 PS: For extra credit, what do you think is the peak memory use for this
 program when given an input file of 150MB?

Well, I'm not going to wait till Wednesday for the numbers!

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.8.2

$ du -hs 150M 
150M



Program 1:

 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Lazy.Char8 as B
 import GHC.Int (Int64)
 
 -- note that D.BS.Lazy.Char8.length is ByteString - Int64
 --   D.BS.C8.length is ByteString - Int
 cnt:: B.ByteString - Int64
 cnt bs = B.length (B.filter (== ' ') bs)
 
 main = do s - B.getContents
  print (cnt s)

Ok, so lazy bytestrings. Should be constant space use, but two traversals
(since there's no lazy bytestring fusion, over each lazy chunk).  Not perfect,
but should be ok. length will consume chunks as they're produced by filter.

** Prediction:
Constant ~3M space use (runtime, plus small overhead)
Fast, due to chunk-wise processing.

** Result:
$ ghc -O2 A.hs -o A --make
$ time ./A  150M +RTS -sstderr
./A +RTS -sstderr  150M  1.01s user 0.10s system 98% cpu 1.123 total

And top says 40M allocated.

** Summary: Ok, pretty fast, but an unexpected(!) amount of memory allocated.

Now, this memory result is suspicious, I wonder if the now obsolete 'array 
fusion'
is messing things up. In Data.ByteString.Lazy, we have:

filter p = F.loopArr . F.loopL (F.filterEFL p) F.NoAcc

We keep meaning to replace this stuff with the stream fusion mechanisms, which 
compile a lot better. Perhaps the time has come to look at that :)

I'll put this memory allocation down as a bug that needs to be looked at.




Program 2:

 hs/space-bs-c8-acc-1.hs:
 {-# LANGUAGE BangPatterns #-}
 
 import qualified Data.ByteString.Char8 as B
 
 cnt:: Int - B.ByteString - Int
 cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
 
 main = do s - B.getContents
  print (cnt 0 s)

Strict bytestrings, and you manually fuse the length/filter calculation.
Allocating all that memory will cost you.

** Prediction:
O(N) memory, around 150M allocated
Slower, due to cache effects (more data to traverse) and more indirections.

** Result   
top says 154M
./B +RTS -sstderr  150M  1.10s user 0.52s system 111% cpu 1.454 total

** Summary: Seems reasonable, when its doing all that allocation. 



Program 2a:

But we could easily improve this program: Since:
length (filter (== ' ')
  ==
length (filterByte ' '
  ==
count ' '

And we have:

import qualified Data.ByteString.Char8 as B

cnt   :: B.ByteString - Int
cnt x = B.count ' ' x

main = do s - B.getContents
  print (cnt s)

** Prediction: Which should be a bit faster.

** Result:
$ time ./B  150M
24569024
./B  150M  0.66s user 0.55s system 113% cpu 1.070 total

** Summary: So that's the fastest program so far.

The rewrite rules to do these transformatoins used to be enabled, but 
need looking at again. There should also be no real benefit to manually fuse a
length . filter loop like this, however, the old fusion system used in
bytestring might have some small overhead here. This also needs looking at.


Program 2b:

We can do even better if we read the file in:

import qualified Data.ByteString.Char8as B
import System.IO.Posix.MMap

main = print . B.count ' ' = mmapFile 150M

** Prediction: super fast

** Result:

$ time ./B2
24569024
./B2  0.31s user 0.01s system 101% cpu 0.314 total

(similar results if you use vanilla B.readFile too, fwiw).

Summary: This suggests to me we could look again at how strings of unknown size
 are read in.




Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Daniel Fischer
Am Sonntag, 16. Dezember 2007 04:07 schrieb Don Stewart:

 
 Program 7:

  ==
  hs/space-x-foldl.hs:
  {-# LANGUAGE BangPatterns #-}

  cnt   :: String - Int
  cnt bs= foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs

  main = do s - getContents
  print (cnt s)

 Hmm. Lazy accumulator eh, on String?  Should exhibit a space leak.

Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job.
It is indeed about 10* slower than ByteStrings, but very memory friendly - 
and, actually on my machine it's faster (not much) than Data.List.foldl' .
And, again on my machine, Programme 3 is almost as slow when compiled with 
6.8.1 and twice as slow when compiled with 6.6.1.

 Nice little benchmark.

 -- Don

Cheers,
Daniel

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote:

  Hmm. Lazy accumulator eh, on String?  Should exhibit a space leak.
 
 Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job.
 It is indeed about 10* slower than ByteStrings, but very memory friendly - 

Daniel is right, there's no space leak.

Try it.  You'll get a nice surprise :)

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Don Stewart
firefly:
 On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote:
 
   Hmm. Lazy accumulator eh, on String?  Should exhibit a space leak.
  
  Doesn't (with -O2, at least), seems ghc's strictness analyser did a good 
  job.
  It is indeed about 10* slower than ByteStrings, but very memory friendly - 
 
 Daniel is right, there's no space leak.
 
 Try it.  You'll get a nice surprise :)

Very nice. If we disable the strictness analsyer,

$ ghc -O2 -fno-strictness A.hs -no-recomp -o A

We get a core loop that looks like:

lgo :: Int - [Char] - Int
lgo n xs = case xs of
  [] - n
  a : as - lgo (case a of  -- sad dons
 C# c1_amH - case c1_amH of {
 ' '   - case n of I# i - I# (i +# 1)
 _- n;
 ) as


Look at that big lazy expression for 'n' not being forced!
And when run:

1015M 1017M onproc/1 - 0:05 22.36% A

Scary stuff.  Lots of Int thunks :)

 But enabling the strictness analyser:

lgo :: Int# - [Char] - Int#
lgo (n :: Int#) (xs :: [Char]) =
case xs of
  [] - n
  a : as - case a of 
C# c - case c of
  ' ' - lgo (n +# 1) as -- makes me happy
  _   - lgo nas

And life is good again :)

What is quite amazing is how efficient this program is.
I had to rerun this a dozen or so times, since I didn't quite believe it:

$ time ./A  /usr/obj/data/150M 
24569024
./A  /usr/obj/data/150M  2.42s user 0.47s system 100% cpu 2.883 total

Pretty stunning, I think.
Swapping in a slightly more eager structure, the lazy ByteString,

$ time ./A  /usr/obj/data/150M
24569024
./A  /usr/obj/data/150M  0.86s user 0.07s system 98% cpu 0.942 total

improves things by a good amount, but I think we can revisit the low level
performance of lazy bytestrings again, in light of all the changes to the
optimiser in the past 2 years.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 21:18 -0800, Don Stewart wrote:

 What is quite amazing is how efficient this program is.

Yep.  I was very surprised too.

Turns out there *was* a reason to run those string benchmarks, eh? ;)

 improves things by a good amount, but I think we can revisit the low level
 performance of lazy bytestrings again, in light of all the changes to the
 optimiser in the past 2 years.

Yep.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 19:07 -0800, Don Stewart wrote:

 Well, I'm not going to wait till Wednesday for the numbers!

But I got you to write down predictions before you ran the tests, which
is precisely what I wanted :)

 Summary,
 
   * Program 1 is fast, as expected, but exhbits a bug in the bytestring
 library's lazy bytestring fusion system.  Something in length or 
 filter
 isn't doing the right job. This code will be replaced by the stream 
 fusion
 system soon.

Good.

   * Program 2: as expected. strict IO uses O(N) space, and that has 
 performance 
 effects.
 
   * Program 3: lazy bytestrings use constant space, but you better avoid
 redundant bounds checks in the inner loops. 

Maybe its extra bounds-checking that makes it slow, as you say.  It
probably is.  I must admit that I couldn't follow the
core/stg/C--/assembly code at all.

   * Program 4: strings are silly

No they are not.

They are the default data structure for text and give the baseline that
bytestrings should beat.

I find it interesting to see *if* bytestrings beat it and if so, by how
much.  The vanilla string versions of my tests all use less memory than
any of the other versions, but they are a bit slower.  And perhaps not
as much slower as they should be...

   * Program 5: as expected. similar to program 2.
 
   * Program 6: strict foldl's over lazy bytestrings are good :)
fast, and constant space.
 
   * Program 7: see program 4.
 
 Pretty much as expected then, but with a bug identified in lazy bytestring 
 fusion (I think).
 
 Nice little benchmark.

Thanks :)

There are more...

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe