Re: [Haskell-cafe] Haskell performance
On Sun, 2007-12-23 at 11:52 +, Ian Lynagh wrote: On Thu, Dec 20, 2007 at 10:58:17AM +, Malcolm Wallace wrote: Nobench does already collect code size, but does not yet display it in the results table. I specifically want to collect compile time as well. Not sure what the best way to measure allocation and peak memory use are? This: http://lists.osuosl.org/pipermail/darcs-devel/2006-January/004016.html should be Haskell-implementation-independent, but is probably Linux-specific. Adapting it to other Unix-like OSes is probably easy, but I have no idea about Windows. Very nice. A short-term improvement would perhaps be to use ptrace() to also sample the program counter register? On a longer-term scale, I wonder how hard it would be to implement a valgrind skin to get much more precise heap-use information... -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] [RFC] Preliminary benchmark graphs
I added Don's three benchmarks and redid all my benchmarks with: ghc 6.6.1 ghc 6.8.2 ghc 6.8.2 + bytestring 0.9.0.2 ghc 6.9.20071119 ghc 6.9.20071119 + bytestring 0.9.0.2 ghc head-as-of-yesterday-around-noon ghc head-as-of-yesterday-around-noon + bytestring 0.9.0.2 I tried to get the draft emails with intro, methodology, discussion, and conclusion completed yesterday but my brain simply wasn't up to it. Unfortunately, it still isn't quite up to it :( Since the perfect is the enemy of the good, I'll post the graphs for all the above 7 runs now. The rest will be forthcoming when I can think straight again, hopefully some time in the evening. I have scripts to help me install precisely the ghc version I want (and to make it easy to duplicate my results). I also have scripts to run the benchmarks, get correct memory measurements (-sstderr doesn't seem trustworthy), check the validity of each timing measurement, generate I/O traces, generate reports, and finally merging reports from different runs with or without rescaling. This attached report uses rescaling since I'm compiling different compiler/library combinations on the same machine. One thing that shows up very clearly in the graphs is that the memory situation is bad and that Don's recent fix only really solves the problem on 6.8.2, not on head. Another thing is that the backend really lets us down. I have hand-tweaked a simple byte counting benchmark and a simple space counting benchmark through increasing degrees of refinement from the banal to the heroic and timed those, too. It seems like improved register use alone would halve the run time. Add a sprinkling of MMX heroics and things get even better (but that's not realistic or even a good idea at the moment). The reason why lazy bytestrings perform so badly, speed-wise, is most likely a backend that doesn't do hoisting of bounds checks out of loops. -Peter charybdis ghc 6.6.1 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX= charybdis ghc 6.8.2 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX= charybdis ghc 6.8.2 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX=-bs0.9.0.2 charybdis ghc 6.9.20071119 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX= charybdis ghc 6.9.20071119 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX=-bs0.9.0.2 charybdis ghc 6.9.20071217 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX= charybdis ghc 6.9.20071217 AMD Athlon(tm) 64 Processor 3000+ 2009.160 MHz TESTKIND=THOROUGH SUFFIX=-bs0.9.0.2 Time (byte counting) std avg dev slack hs/byte-bsacc: 1.020 40‰ 0.0 █▏| -- 0.703 5‰ 0.4 ███▌ | -- 0.702 7‰ 0.3 ███▌ | -- 0.705 7‰ 0.1 ███▋ | -- 0.712 3‰ 0.1 ███▋ | -- 0.706 7‰ 0.1 ███▋ | -- 0.707 7‰ 0.9 ███▋ | hs/byte-bsfoldlx:0.789 3‰ 0.3 | -- 0.993 2‰ 0.2 █ | -- 1.102 1‰ 0.2 █▋| -- 1.002 1‰ 0.5 █▏| -- 1.112 1‰ 0.3 █▋| -- 1.024 2‰ 0.2 █▏| -- 1.111 1‰ 0.1 █▋| hs/byte-bsfoldrx:0.813 3‰ 0.1 ▏ | -- 1.102 2‰ 0.3 █▋| -- 1.100 1‰ 0.1 █▋| -- 1.112 2‰ 0.1 █▋| -- 1.114 1‰ 0.4 █▋| -- 1.113 2‰ 0.5 █▋| -- 1.112 1‰ 0.2 █▋| hs/byte-bsl---acc: 3.599 13‰ 0.0 ██▎ | -- 2.609 17‰ 0.0 █▎| -- 2.560 15‰ 0.1 █ | -- 2.595 14‰ 0.1 █▏| -- 2.574 16‰ 0.1 █ | -- 2.613 12‰ 0.2 █▎| -- 2.656 44‰ 0.1 █▌| hs/byte-x-acc-1: 4.606 5‰
[Haskell-cafe] Re: Haskell performance
On Thu, 2007-12-20 at 10:37 +, Simon Peyton-Jones wrote: Don, and others, This thread triggered something I've had at the back of my mind for some time. The traffic on Haskell Cafe suggests that there is a lot of interest in the performance of Haskell programs. However, at the moment we don't have any good *performance* regression tests for GHC. We have zillions of behavioural regression tests (this program should compile, this one should fail), but nothing much on performance. We have the nofib suite, but it's pretty static these days. Peter's set of benchmarks are great (if very specific to strings etc, but that's fine), and it'd be a pity of they now sink beneath the waves. They won't! I have set up a mercurial repository on http://vax64.dyndns.org/repo/hg/ together with the ghc install scripts I've used. Once the basic string performance is under control, I intend to expand it with more advanced parsing, with I/O, and with backend stuff. I like Parsec. But it seems to hang on to a bit more memory than it should and I think it should be faster than it is. Fast I/O is not simple, and to do it really well, one probably needs to use threading and mmap() in combination. mmap() alone is usually not very performant unless the file has already been cached by the operating system. And the backend. Ouch. The frontend is absolutely fantastic and does heroic stuff -- but the backend... apart from having many phases, it doesn't do much ;) What would be v helpful would be a regression suite aimed at performance, that benchmarked GHC (and perhaps other Haskell compilers) against a set of programs, regularly, and published the results on a web page, highlighting regressions. Kind of like the Shootout, only just for Haskell, and with many more programs. I don't see why a lot of that couldn't be added to the framework I have. It's GPLv2 :) Like Hackage, it should be easy to add a new program. It'd be good to measure run-time, but allocation count, peak memory use, code size, My framework captures the allocation count but it doesn't use it for anything. It gets its peak memory info from /proc/self/status (which it captures, together with /proc/self/maps, through a LD_PRELOAD trick). '-sstderr' seemed a bit unreliable in my experience, so I fell back to asking the operating system. Making sure one gets stable times + a good estimate of the quality of the measurements is also important (which my code already does). compilation time are also good (and rather more stable) numbers to capture. Does anyone feel like doing this? It'd be a great service. No need to know anything much about GHC. I think I've made a start but this is clearly not something I'm willing to take on by myself. -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] New to Haskell
On Tue, 2007-12-18 at 12:53 +0200, Cristian Baboi wrote: The semantics of IO, and the guarantees of the runtime. IO specifies that () means compose two actions to make a larger action which does the first actions, then the second action. [do {a; a;} is notation for a a] The RTS specifies that the main action is performed exactly once. Is this dependent on the implementation (if I use GHC or Hugs) or is something that the language say ? Part of the language. You do get your guarantee written in blood. -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sun, 2007-12-16 at 15:21 -0800, Don Stewart wrote: An updated bytestring library is at : http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-0.9.0.2 Enjoy! :) Thanks! -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
What do you think the relative speeds are of the six small haskell programs at the end of this email? All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function. Say the fastest one takes the time 1. How much time will the others take? And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)? I'm going to post full benchmarks + analysis on Wednesday. -Peter *) hardddisk megabytes. The file is 15034 bytes ≈ 143 mebibytes. PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB? {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString - Int64 -- D.BS.C8.length is ByteString - Int cnt :: B.ByteString - Int64 cnt bs = B.length (B.filter (== ' ') bs) main = do s - B.getContents print (cnt s) == hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int - B.ByteString - Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s - B.getContents print (cnt 0 s) == hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int - B.ByteString - Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s - B.getContents print (cnt 0 s) == hs/space-x-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int - String - Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s - getContents print (cnt 0 s) == hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString - Int cnt bs = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - B.getContents print (cnt s) == hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString - Int cnt bs = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - B.getContents print (cnt s) == hs/space-x-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String - Int cnt bs = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - getContents print (cnt s) == ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sat, 2007-12-15 at 14:34 +, Duncan Coutts wrote: Ok, I presume this is a guessing game and we're supposed to just look at the code without running and timing them. Precisely :) All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function. Say the fastest one takes the time 1. How much time will the others take? And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)? I'm going to post full benchmarks + analysis on Wednesday. Right'o. I'll have a go. Lets see if I can't embarrass myself with being completely inaccurate. Thanks for biting! You were, thankfully, only almost completely inaccurate ;) PS: For extra credit, what do you think is the peak memory use for this program when given an input file of 150MB? Hmm. So that should work in constant memory, a few 64 chunks at once. I'd expect this to be pretty fast. You are right about the speed. Can you guess a number in kilobytes? == hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int - B.ByteString - Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s - B.getContents print (cnt 0 s) This uses strict bytestrings so will use at least 150Mb and that'll make it a good deal slower. In fact it'll be worse than that since getContents does not know in advance how big the input will be so it has to play the doubling and copying game. So it'll end up copying all the data roughly twice. cnt is strict and tail recursive so that shouldn't be any problem, though it's probably not as fast as the first length . filter since head, tail, null all have to do bounds checks. You are right about the memory. It is actually slightly faster than the extra credit (length/filter combination) above. == hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int - B.ByteString - Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s - B.getContents print (cnt 0 s) For the same reason as above, I'd expect this cnt to be slower than B.length . B.filter (== ' ') It is slower but not for the same reason as above. == hs/space-x-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int - String - Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s - getContents print (cnt 0 s) Lazy, so constant memory use, but much higher constant factors due to using String. Spot on. == hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString - Int cnt bs = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - B.getContents print (cnt s) This is of course still strict so that's going to make the reading slow. Nope. This is a manually fused B.length . B.filter (== ' ') which hopefully is the same speed as the automatically fused one if the fusion is working ok. If not, then the B.length . B.filter (== ' ') will be doing a extra copy, and memory writes are expensive. == hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString - Int cnt bs = B.foldl' (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - B.getContents print (cnt s) As above but now in constant memory space. Nope. == hs/space-x-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String - Int cnt bs = foldl (\sum c - if c == ' ' then sum+1 else sum) 0 bs main = do s - getContents print (cnt s) Oh, no! not foldl that's a killer. You think it's worse than the program just above? Ok, so best way to summarise I think is to organise by data type since I think that'll dominate. So I think the lazy bytestring versions will be fastest due to having the best memory access patterns and doing the least copying. I think the foldl's will be faster than the explicit accumulators due to having fewer bounds checks. space-bslc8-foldlx-1
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sat, 2007-12-15 at 11:59 -0800, Don Stewart wrote: firefly: What do you think the relative speeds are of the six small haskell programs at the end of this email? All they do is read from stdin and count the number of spaces they see. There are two that use strict bytestrings, two that use lazy bytestrings, and two that use the standard Haskell strings. Three use a recursive function with an accumulator parameter and three use a foldl with a lambda function. Say the fastest one takes the time 1. How much time will the others take? And how about memory? How much memory do you think they require? Let's say we feed a 150MB(*) file into each of them, how many megabytes do you think they end up using (as seen from the OS, not in terms of how big the live heap is)? I'm going to post full benchmarks + analysis on Wednesday. How are you compiling these programs, by the way? ghc-6.8.2 -O2 ? (-O2 is required for bytestrings :) With -O2. I have measured with 6.8.1, 6.9.20071119, 6.9.20071208 (approx), and 6.9.200712xx (as of yesterday or today). The picture changes very little with the compiler, if it changes at all. I have run them on three very different microarchitectures (2GHz Athlon64 3000+, 1667MHz Core Duo, 600MHz Pentium III). All the measurements are scripted. 'make phase1' compiles the benchmarks, creates input files of various sizes, and runs each benchmark once to gather information about memory use (peak RSS + ghc RTS' own information about allocations and gcs), page faults, and an strace. This phase is not timing sensitive so I can browse the web and listen to the music etc. while running it. 'make phase2' runs each benchmark a number of times, calculates the average time for each + the relative size of the standard deviation + how much user+sys is different from real (as reported by bash' built-in time command). A report with barcharts indicating relative time and relative peak RSS is generated in either pure ASCII or in UTF-8 (with fractional-width block chars so the charts look nice and have high resolution). If the measurements are deemed to be bad (too high standard deviation or user+sys doesn't add up to real) then the barchart is done with '%' characters. The quality indicators for each timing test are always printed out next to each bar, so we know how close we are to being perfect or, conversely, how bad the measurements are. There is a script that analyzes the I/O pattern and sums it up (as 4375 x read(3, 4096, ...) = 4096 followed by 1 x read(3, 4096, ...) = 1242 followed by 1 x read(3, 4096, ...) = 0 an similar). There are a set of simple I/O programs in C so we can compare ghc's performance with speed of light, and so different I/O strategies can be compared in a cleaner, purer form. There are also 'make cache' (runs cachegrind), 'make hs/.doc' (creates a file with source code + core + stg + c-- + assembly + times for a given benchmark), etc. 'make sysinfo' creates a file with information about the Linux distribution used (/etc/lsb-release), kernel version (uname -a), CPU (/proc/cpuinfo), and compilers used (ghc --version, gcc --version). 'make zipdata' creates a zip file of about 20K with all the raw time measurements + the sysinfo. I also have a set of scripts that installs each ghc version so 1) it is easy for me to repeat the tests on various machines and 2) you can see exactly which ghc versions I use and repeat the experiments yourself. -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sun, 2007-12-16 at 04:53 +0100, Daniel Fischer wrote: Hmm. Lazy accumulator eh, on String? Should exhibit a space leak. Doesn't (with -O2, at least), seems ghc's strictness analyser did a good job. It is indeed about 10* slower than ByteStrings, but very memory friendly - Daniel is right, there's no space leak. Try it. You'll get a nice surprise :) -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sat, 2007-12-15 at 21:18 -0800, Don Stewart wrote: What is quite amazing is how efficient this program is. Yep. I was very surprised too. Turns out there *was* a reason to run those string benchmarks, eh? ;) improves things by a good amount, but I think we can revisit the low level performance of lazy bytestrings again, in light of all the changes to the optimiser in the past 2 years. Yep. -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] [RFC] benchmarks of bytestrings, teaser
On Sat, 2007-12-15 at 19:07 -0800, Don Stewart wrote: Well, I'm not going to wait till Wednesday for the numbers! But I got you to write down predictions before you ran the tests, which is precisely what I wanted :) Summary, * Program 1 is fast, as expected, but exhbits a bug in the bytestring library's lazy bytestring fusion system. Something in length or filter isn't doing the right job. This code will be replaced by the stream fusion system soon. Good. * Program 2: as expected. strict IO uses O(N) space, and that has performance effects. * Program 3: lazy bytestrings use constant space, but you better avoid redundant bounds checks in the inner loops. Maybe its extra bounds-checking that makes it slow, as you say. It probably is. I must admit that I couldn't follow the core/stg/C--/assembly code at all. * Program 4: strings are silly No they are not. They are the default data structure for text and give the baseline that bytestrings should beat. I find it interesting to see *if* bytestrings beat it and if so, by how much. The vanilla string versions of my tests all use less memory than any of the other versions, but they are a bit slower. And perhaps not as much slower as they should be... * Program 5: as expected. similar to program 2. * Program 6: strict foldl's over lazy bytestrings are good :) fast, and constant space. * Program 7: see program 4. Pretty much as expected then, but with a bug identified in lazy bytestring fusion (I think). Nice little benchmark. Thanks :) There are more... -Peter ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe