Re: [Haskell-cafe] Haskell performance

2007-12-24 Thread Peter Lund
On Sun, 2007-12-23 at 11:52 +, Ian Lynagh wrote:
 On Thu, Dec 20, 2007 at 10:58:17AM +, Malcolm Wallace wrote:
  
  Nobench does already collect code size, but does not yet display it in
  the results table.  I specifically want to collect compile time as well.
  Not sure what the best way to measure allocation and peak memory use
  are?
 
 This:
 http://lists.osuosl.org/pipermail/darcs-devel/2006-January/004016.html
 should be Haskell-implementation-independent, but is probably
 Linux-specific. Adapting it to other Unix-like OSes is probably easy,
 but I have no idea about Windows.

Very nice.

A short-term improvement would perhaps be to use ptrace() to also sample
the program counter register?

On a longer-term scale, I wonder how hard it would be to implement a
valgrind skin to get much more precise heap-use information...

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] [RFC] Preliminary benchmark graphs

2007-12-20 Thread Peter Lund
I added Don's three benchmarks and redid all my benchmarks with:
  ghc 6.6.1
  ghc 6.8.2
  ghc 6.8.2 + bytestring 0.9.0.2
  ghc 6.9.20071119
  ghc 6.9.20071119 + bytestring 0.9.0.2
  ghc head-as-of-yesterday-around-noon
  ghc head-as-of-yesterday-around-noon + bytestring 0.9.0.2

I tried to get the draft emails with intro, methodology, discussion, and
conclusion completed yesterday but my brain simply wasn't up to it.

Unfortunately, it still isn't quite up to it :(

Since the perfect is the enemy of the good, I'll post the graphs for all
the above 7 runs now.

The rest will be forthcoming when I can think straight again, hopefully
some time in the evening.

I have scripts to help me install precisely the ghc version I want (and
to make it easy to duplicate my results).  I also have scripts to run
the benchmarks, get correct memory measurements (-sstderr doesn't seem
trustworthy), check the validity of each timing measurement, generate
I/O traces, generate reports, and finally merging reports from different
runs with or without rescaling.

This attached report uses rescaling since I'm compiling different
compiler/library combinations on the same machine.

One thing that shows up very clearly in the graphs is that the memory
situation is bad and that Don's recent fix only really solves the
problem on 6.8.2, not on head.

Another thing is that the backend really lets us down.  I have
hand-tweaked a simple byte counting benchmark and a simple space
counting benchmark through increasing degrees of refinement from the
banal to the heroic and timed those, too.  It seems like improved
register use alone would halve the run time.  Add a sprinkling of MMX
heroics and things get even better (but that's not realistic or even a
good idea at the moment).  The reason why lazy bytestrings perform so
badly, speed-wise, is most likely a backend that doesn't do hoisting of
bounds checks out of loops.

-Peter

charybdis
ghc 6.6.1
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=


charybdis
ghc 6.8.2
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=


charybdis
ghc 6.8.2
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=-bs0.9.0.2


charybdis
ghc 6.9.20071119
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=


charybdis
ghc 6.9.20071119
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=-bs0.9.0.2


charybdis
ghc 6.9.20071217
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=


charybdis
ghc 6.9.20071217
AMD Athlon(tm) 64 Processor 3000+
2009.160 MHz
TESTKIND=THOROUGH
SUFFIX=-bs0.9.0.2


Time (byte counting)   std
   avg dev slack
hs/byte-bsacc:   1.020 40‰ 0.0  █▏|
 --  0.703  5‰ 0.4  ███▌  |
 --  0.702  7‰ 0.3  ███▌  |
 --  0.705  7‰ 0.1  ███▋  |
 --  0.712  3‰ 0.1  ███▋  |
 --  0.706  7‰ 0.1  ███▋  |
 --  0.707  7‰ 0.9  ███▋  |
hs/byte-bsfoldlx:0.789  3‰ 0.3    |
 --  0.993  2‰ 0.2  █ |
 --  1.102  1‰ 0.2  █▋|
 --  1.002  1‰ 0.5  █▏|
 --  1.112  1‰ 0.3  █▋|
 --  1.024  2‰ 0.2  █▏|
 --  1.111  1‰ 0.1  █▋|
hs/byte-bsfoldrx:0.813  3‰ 0.1  ▏ |
 --  1.102  2‰ 0.3  █▋|
 --  1.100  1‰ 0.1  █▋|
 --  1.112  2‰ 0.1  █▋|
 --  1.114  1‰ 0.4  █▋|
 --  1.113  2‰ 0.5  █▋|
 --  1.112  1‰ 0.2  █▋|
hs/byte-bsl---acc:   3.599 13‰ 0.0  ██▎   |
 --  2.609 17‰ 0.0  █▎|
 --  2.560 15‰ 0.1  █ |
 --  2.595 14‰ 0.1  █▏|
 --  2.574 16‰ 0.1  █ |
 --  2.613 12‰ 0.2  █▎|
 --  2.656 44‰ 0.1  █▌|
hs/byte-x-acc-1: 4.606  5‰ 

[Haskell-cafe] Re: Haskell performance

2007-12-20 Thread Peter Lund
On Thu, 2007-12-20 at 10:37 +, Simon Peyton-Jones wrote:
 Don, and others,
 
 This thread triggered something I've had at the back of my mind for some time.
 
 The traffic on Haskell Cafe suggests that there is a lot of interest
 in the performance of Haskell programs.  However, at the moment we
 don't have any good *performance* regression tests for GHC. We have
 zillions of behavioural regression tests (this program should compile,
 this one should fail), but nothing much on performance. We have the
 nofib suite, but it's pretty static these days.  Peter's set of
 benchmarks are great (if very specific to strings etc, but that's
 fine), and it'd be a pity of they now sink beneath the waves.

They won't!  I have set up a mercurial repository on
http://vax64.dyndns.org/repo/hg/ together with the ghc install scripts
I've used.

Once the basic string performance is under control, I intend to expand
it with more advanced parsing, with I/O, and with backend stuff.

I like Parsec.  But it seems to hang on to a bit more memory than it
should and I think it should be faster than it is.

Fast I/O is not simple, and to do it really well, one probably needs to
use threading and mmap() in combination.  mmap() alone is usually not
very performant unless the file has already been cached by the operating
system.

And the backend.  Ouch.  The frontend is absolutely fantastic and does
heroic stuff -- but the backend... apart from having many phases, it
doesn't do much ;)

 What would be v helpful would be a regression suite aimed at
 performance, that benchmarked GHC (and perhaps other Haskell
 compilers) against a set of programs, regularly, and published the
 results on a web page, highlighting regressions.  Kind of like the
 Shootout, only just for Haskell, and with many more programs.

I don't see why a lot of that couldn't be added to the framework I have.
It's GPLv2 :)

 Like Hackage, it should be easy to add a new program.  It'd be good to
 measure run-time, but allocation count, peak memory use, code size,

My framework captures the allocation count but it doesn't use it for
anything.  It gets its peak memory info from /proc/self/status (which it
captures, together with /proc/self/maps, through a LD_PRELOAD trick).
'-sstderr' seemed a bit unreliable in my experience, so I fell back to
asking the operating system.

Making sure one gets stable times + a good estimate of the quality of
the measurements is also important (which my code already does).

  compilation time are also good (and rather more stable) numbers to
 capture.
 
 Does anyone feel like doing this?  It'd be a great service.  No need
 to know anything much about GHC.

I think I've made a start but this is clearly not something I'm willing
to take on by myself.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] New to Haskell

2007-12-18 Thread Peter Lund
On Tue, 2007-12-18 at 12:53 +0200, Cristian Baboi wrote:

  The semantics of IO, and the guarantees of the runtime.
 
  IO specifies that () means compose two actions to make a larger  
  action which does the first actions, then the second action.
 
  [do {a; a;} is notation for a  a]
 
  The RTS specifies that the main action is performed exactly once.
 
 Is this dependent on the implementation (if I use GHC or Hugs) or is  
 something that the language say ?

Part of the language.  You do get your guarantee written in blood.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-17 Thread Peter Lund
On Sun, 2007-12-16 at 15:21 -0800, Don Stewart wrote:

 An updated bytestring library is at :
 
 
 http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.2
 
 Enjoy! :)

Thanks!

-Peter


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
What do you think the relative speeds are of the six small haskell
programs at the end of this email?

All they do is read from stdin and count the number of spaces they see.
There are two that use strict bytestrings, two that use lazy
bytestrings, and two that use the standard Haskell strings.  Three use a
recursive function with an accumulator parameter and three use a foldl
with a lambda function.

Say the fastest one takes the time 1.  How much time will the others
take?

And how about memory?  How much memory do you think they require?  Let's
say we feed a 150MB(*) file into each of them, how many megabytes do you
think they end up using (as seen from the OS, not in terms of how big
the live heap is)?

I'm going to post full benchmarks + analysis on Wednesday.

-Peter

*) hardddisk megabytes.  The file is 15034 bytes ≈ 143 mebibytes.


PS: For extra credit, what do you think is the peak memory use for this
program when given an input file of 150MB?


{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B
import GHC.Int (Int64)

-- note that D.BS.Lazy.Char8.length is ByteString - Int64
--   D.BS.C8.length is ByteString - Int
cnt :: B.ByteString - Int64
cnt bs  = B.length (B.filter (== ' ') bs)

main = do s - B.getContents
  print (cnt s)




==
hs/space-bs-c8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: Int - B.ByteString - Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s - B.getContents
  print (cnt 0 s)
==
hs/space-bslc8-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: Int - B.ByteString - Int
cnt !acc bs = if B.null bs
then acc
else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)

main = do s - B.getContents
  print (cnt 0 s)
==
hs/space-x-acc-1.hs:
{-# LANGUAGE BangPatterns #-}

cnt :: Int - String - Int
cnt !acc bs = if null bs
then acc
else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)

main = do s - getContents
  print (cnt 0 s)
==
hs/space-bs-c8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Char8 as B

cnt :: B.ByteString - Int
cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - B.getContents
  print (cnt s)
==
hs/space-bslc8-foldlx-1.hs:
{-# LANGUAGE BangPatterns #-}

import qualified Data.ByteString.Lazy.Char8 as B

cnt :: B.ByteString - Int
cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - B.getContents
  print (cnt s)
==
hs/space-x-foldl.hs:
{-# LANGUAGE BangPatterns #-}

cnt :: String - Int
cnt bs  = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs

main = do s - getContents
  print (cnt s)
==


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 14:34 +, Duncan Coutts wrote:

 Ok, I presume this is a guessing game and we're supposed to just look at
 the code without running and timing them.

Precisely :)

  All they do is read from stdin and count the number of spaces they see.
  There are two that use strict bytestrings, two that use lazy
  bytestrings, and two that use the standard Haskell strings.  Three use a
  recursive function with an accumulator parameter and three use a foldl
  with a lambda function.
  
  Say the fastest one takes the time 1.  How much time will the others
  take?
  
  And how about memory?  How much memory do you think they require?  Let's
  say we feed a 150MB(*) file into each of them, how many megabytes do you
  think they end up using (as seen from the OS, not in terms of how big
  the live heap is)?
  
  I'm going to post full benchmarks + analysis on Wednesday.
 
 Right'o. I'll have a go. Lets see if I can't embarrass myself with being
 completely inaccurate.

Thanks for biting!

You were, thankfully, only almost completely inaccurate ;)

  PS: For extra credit, what do you think is the peak memory use for this
  program when given an input file of 150MB?

 Hmm. So that should work in constant memory, a few 64 chunks at once.
 I'd expect this to be pretty fast.

You are right about the speed.
Can you guess a number in kilobytes?

  
  
  
  ==
  hs/space-bs-c8-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Char8 as B
  
  cnt :: Int - B.ByteString - Int
  cnt !acc bs = if B.null bs
  then acc
  else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
  
  main = do s - B.getContents
print (cnt 0 s)
 
 This uses strict bytestrings so will use at least 150Mb and that'll make
 it a good deal slower. In fact it'll be worse than that since
 getContents does not know in advance how big the input will be so it has
 to play the doubling and copying game. So it'll end up copying all the
 data roughly twice. cnt is strict and tail recursive so that shouldn't
 be any problem, though it's probably not as fast as the first length .
 filter since head, tail, null all have to do bounds checks.

You are right about the memory.  It is actually slightly faster than the
extra credit (length/filter combination) above.

  ==
  hs/space-bslc8-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Lazy.Char8 as B
  
  cnt :: Int - B.ByteString - Int
  cnt !acc bs = if B.null bs
  then acc
  else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs)
  
  main = do s - B.getContents
print (cnt 0 s)
 
 For the same reason as above, I'd expect this cnt to be slower than
 B.length . B.filter (== ' ')

It is slower but not for the same reason as above.

  ==
  hs/space-x-acc-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  cnt :: Int - String - Int
  cnt !acc bs = if null bs
  then acc
  else cnt (if head bs == ' ' then acc+1 else acc) (tail bs)
  
  main = do s - getContents
print (cnt 0 s)
 
 Lazy, so constant memory use, but much higher constant factors due to
 using String.

Spot on.

  ==
  hs/space-bs-c8-foldlx-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Char8 as B
  
  cnt :: B.ByteString - Int
  cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - B.getContents
print (cnt s)
 
 This is of course still strict so that's going to make the reading slow.

Nope.

 This is a manually fused B.length . B.filter (== ' ') which hopefully is
 the same speed as the automatically fused one if the fusion is working
 ok. If not, then the B.length . B.filter (== ' ') will be doing a extra
 copy, and memory writes are expensive.
 
  ==
  hs/space-bslc8-foldlx-1.hs:
  {-# LANGUAGE BangPatterns #-}
  
  import qualified Data.ByteString.Lazy.Char8 as B
  
  cnt :: B.ByteString - Int
  cnt bs  = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - B.getContents
print (cnt s)
 
 As above but now in constant memory space.

Nope.

  ==
  hs/space-x-foldl.hs:
  {-# LANGUAGE BangPatterns #-}
  
  cnt :: String - Int
  cnt bs  = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs
  
  main = do s - getContents
print (cnt s)
 
 Oh, no! not foldl that's a killer.

You think it's worse than the program just above?

 
 Ok, so best way to summarise I think is to organise by data type since I
 think that'll dominate.
 
 So I think the lazy bytestring versions will be fastest due to having
 the best memory access patterns and doing the least copying. I think the
 foldl's will be faster than the explicit accumulators due to having
 fewer bounds checks.
 
 space-bslc8-foldlx-1
 

Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 11:59 -0800, Don Stewart wrote:
 firefly:
  What do you think the relative speeds are of the six small haskell
  programs at the end of this email?
  
  All they do is read from stdin and count the number of spaces they see.
  There are two that use strict bytestrings, two that use lazy
  bytestrings, and two that use the standard Haskell strings.  Three use a
  recursive function with an accumulator parameter and three use a foldl
  with a lambda function.
  
  Say the fastest one takes the time 1.  How much time will the others
  take?
  
  And how about memory?  How much memory do you think they require?  Let's
  say we feed a 150MB(*) file into each of them, how many megabytes do you
  think they end up using (as seen from the OS, not in terms of how big
  the live heap is)?
  
  I'm going to post full benchmarks + analysis on Wednesday.
 
 How are you compiling these programs, by the way?  ghc-6.8.2 -O2  ?
 (-O2 is required for bytestrings :)

With -O2.  I have measured with 6.8.1, 6.9.20071119, 6.9.20071208
(approx), and 6.9.200712xx (as of yesterday or today).  The picture
changes very little with the compiler, if it changes at all.

I have run them on three very different microarchitectures (2GHz
Athlon64 3000+, 1667MHz Core Duo, 600MHz Pentium III).

All the measurements are scripted.  'make phase1' compiles the
benchmarks, creates input files of various sizes, and runs each
benchmark once to gather information about memory use (peak RSS + ghc
RTS' own information about allocations and gcs), page faults, and an
strace.  This phase is not timing sensitive so I can browse the web and
listen to the music etc. while running it.

'make phase2' runs each benchmark a number of times, calculates the
average time for each + the relative size of the standard deviation +
how much user+sys is different from real (as reported by bash' built-in
time command).  A report with barcharts indicating relative time and
relative peak RSS is generated in either pure ASCII or in UTF-8 (with
fractional-width block chars so the charts look nice and have high
resolution).  If the measurements are deemed to be bad (too high
standard deviation or user+sys doesn't add up to real) then the barchart
is done with '%' characters.  The quality indicators for each timing
test are always printed out next to each bar, so we know how close we
are to being perfect or, conversely, how bad the measurements are.

There is a script that analyzes the I/O pattern and sums it up (as 4375
x read(3, 4096, ...) = 4096  followed by 1 x read(3, 4096, ...) = 1242
followed by 1 x read(3, 4096, ...) = 0 an similar).

There are a set of simple I/O programs in C so we can compare ghc's
performance with speed of light, and so different I/O strategies can
be compared in a cleaner, purer form.

There are also 'make cache' (runs cachegrind), 'make
hs/.doc' (creates a file with source code + core + stg + c-- +
assembly + times for a given benchmark), etc.

'make sysinfo' creates a file with information about the Linux
distribution used (/etc/lsb-release), kernel version (uname -a), CPU
(/proc/cpuinfo), and compilers used (ghc --version, gcc --version).

'make zipdata' creates a zip file of about 20K with all the raw time
measurements + the sysinfo.

I also have a set of scripts that installs each ghc version so 1) it is
easy for me to repeat the tests on various machines and 2) you can see
exactly which ghc versions I use and repeat the experiments yourself.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote:

  Hmm. Lazy accumulator eh, on String?  Should exhibit a space leak.
 
 Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job.
 It is indeed about 10* slower than ByteStrings, but very memory friendly - 

Daniel is right, there's no space leak.

Try it.  You'll get a nice surprise :)

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 21:18 -0800, Don Stewart wrote:

 What is quite amazing is how efficient this program is.

Yep.  I was very surprised too.

Turns out there *was* a reason to run those string benchmarks, eh? ;)

 improves things by a good amount, but I think we can revisit the low level
 performance of lazy bytestrings again, in light of all the changes to the
 optimiser in the past 2 years.

Yep.

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser

2007-12-15 Thread Peter Lund
On Sat, 2007-12-15 at 19:07 -0800, Don Stewart wrote:

 Well, I'm not going to wait till Wednesday for the numbers!

But I got you to write down predictions before you ran the tests, which
is precisely what I wanted :)

 Summary,
 
   * Program 1 is fast, as expected, but exhbits a bug in the bytestring
 library's lazy bytestring fusion system.  Something in length or 
 filter
 isn't doing the right job. This code will be replaced by the stream 
 fusion
 system soon.

Good.

   * Program 2: as expected. strict IO uses O(N) space, and that has 
 performance 
 effects.
 
   * Program 3: lazy bytestrings use constant space, but you better avoid
 redundant bounds checks in the inner loops. 

Maybe its extra bounds-checking that makes it slow, as you say.  It
probably is.  I must admit that I couldn't follow the
core/stg/C--/assembly code at all.

   * Program 4: strings are silly

No they are not.

They are the default data structure for text and give the baseline that
bytestrings should beat.

I find it interesting to see *if* bytestrings beat it and if so, by how
much.  The vanilla string versions of my tests all use less memory than
any of the other versions, but they are a bit slower.  And perhaps not
as much slower as they should be...

   * Program 5: as expected. similar to program 2.
 
   * Program 6: strict foldl's over lazy bytestrings are good :)
fast, and constant space.
 
   * Program 7: see program 4.
 
 Pretty much as expected then, but with a bug identified in lazy bytestring 
 fusion (I think).
 
 Nice little benchmark.

Thanks :)

There are more...

-Peter

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe