Re: [Haskell-cafe] Known Unknowns

2006-02-02 Thread Isaac Gouy


--- Ketil Malde [EMAIL PROTECTED] wrote:

 Isaac Gouy [EMAIL PROTECTED] writes:
 
  Programmer skill and effort really does matter ;-)
 
 Yes, more so, than any inherent language
 disadvantage, perhaps, which
 happens to be the general lesson from the ICFP
 contests as well.  Any
 idea if other languages have seen similar efforts?

FreePascal and Smart Effiel, somewhat - and there have
been excellent individual efforts with Lua and Tcl and
...

imo the Haskell Cafe discussions and wiki have been a
more open and shared learning experience than we
usually see, and maybe some of the success stems from
that collaboration and competition.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-02 Thread Chris Kuklewicz
Joel Koerwer wrote:
 Don, that's a great little mini tutorial, exactly what I was hoping for.
 I'm looking forward to learning more tricks.
 
 On an unrelated note, I have an STUArray nbody. I haven't really looked
 closely at the chris+dons version, but I suspect they amount to doing
 the same thing. I get commensurate runtimes at least. But I'll post it
 on the wiki in a while in case there is some optimization I missed.
 
 On an even more unrelated note, I get slower runtimes with -optc-O3 and
 -optc-ffast-math than without. Individually or in tandem. Odd. This is
 ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M.

More architecture benchmarking:

On a powerbook G4, Joel Koerwer's entry on http://haskell.org/hawiki/NbodyEntry
runs faster than dons+chris.

And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also made
it run quite a bit faster.  It now takes 1.7x less time than dons+chris.

Don, could you check the speed on your architecture?

-- 
Chris
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-02 Thread Donald Bruce Stewart
haskell:
 Joel Koerwer wrote:
  Don, that's a great little mini tutorial, exactly what I was hoping for.
  I'm looking forward to learning more tricks.
  
  On an unrelated note, I have an STUArray nbody. I haven't really looked
  closely at the chris+dons version, but I suspect they amount to doing
  the same thing. I get commensurate runtimes at least. But I'll post it
  on the wiki in a while in case there is some optimization I missed.
  
  On an even more unrelated note, I get slower runtimes with -optc-O3 and
  -optc-ffast-math than without. Individually or in tandem. Odd. This is
  ghc 6.4.1 and gcc 4.0.3 on a (Banias) Pentium M.
 
 More architecture benchmarking:
 
 On a powerbook G4, Joel Koerwer's entry on 
 http://haskell.org/hawiki/NbodyEntry
 runs faster than dons+chris.
 
 And I edited it to make it smaller, but by hoisting 'size' and 'dt' I also 
 made
 it run quite a bit faster.  It now takes 1.7x less time than dons+chris.
 
 Don, could you check the speed on your architecture?

Yes! I was missing the -funbox-strict-fields, it makes a huge
difference.  Check the wiki page, these stuarrays are the fastest yet.

Cheers,
  Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-02 Thread Joel Koerwer
Hey this is great. Chris your improvements are awesome. I mean the
speed is nice, but you really cleaned up the code.

There's an extraneous call to energy in the second runST block, but it
should be insignificant. Also, -fglasgow-exts is necessary for the
left-hand-side type declarations of size and dt.

One question, in:

calcMomentum (i+1) $! (px+vx*m,py+vy*m,pz+vz*m)

that $! doesn't actually do much for a tuple, does it? Of course,
there's not much point in further optimizing the initialization
routine, as we'd never be able to detect the difference in runtime.

The shootout has been a great learning tool for me :-)

Thanks to Chris, Don, and the rest of the Haskell community.

Joel
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-01 Thread Chris Kuklewicz
Donald Bruce Stewart wrote:

 This entry in fact runs faster than the original (though not the new 
 vectorised
 entry) optimised C entry (and faster than all other languages):
 
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsumslang=all
 
 So, by carefully tweaking things, we first squished a space leak, and then 
 gained another
 30%.
 
 In summary:
   * Check the Core that is generated
   * Watch out for optimisations that are missed
   * Read the generated C for the tight loops.
   * Make sure tight loops are unboxed
   * Use -fexcess-precision and -optc-ffast-math for doubles
 
 This is roughly the process I used for the other shootout entries.
 
 Cheers,
   Don
 

I just looked hard at the new vectorised entry and the original entry for C.
In both, the last two functions, which use the alt-ernating sign, are *not* done
in the required naive fashion:

   sum = 0.0;
   for (k = 1; k = n-1; k += 2) sum += 1.0/kd;
   for (k = 2; k = n; k += 2) sum -= 1.0/kd;
   printf(%.9f\tAlternating Harmonic\n, sum);
 
   sum = 0.0;
   for (k = 1; k = 2*n-1; k += 4) sum += 1.0/kd;
   for (k = 3; k = 2*n; k += 4) sum -= 1.0/kd;
   printf(%.9f\tGregory\n, sum);

As you can see, all the positive terms are added to sum, then all the negative
terms.  The double precision math comes to a different result, but this is
hidden by printing only 9 digits.

I just modified the g++ entry and the Haskell entry which do it right and the c
entry to print more digits (e.g. show sum in Haskell).

The Haskell entry and g++ entry agree, as expected.  The c entry does *not* 
agree:

AltHarm 0.6931469805600938 Haskell
0.6931469805600938283163259256980381906032562255859375... g++
0.69314698056038037687898167860112152993679046630859375000... gcc
Gregory 0.7853980633974356 Haskell
0.785398063397435564070292457472532987594604492187500... g++
0.7853980633973864922126040255534462630748748779296875000... gcc

The gcc entry is computing a different answer since it uses the wrong order for
making the partial sum.

-- 
Chris

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-01 Thread Chris Kuklewicz
Bulat Ziganshin wrote:
 Hello Donald,
 
 Wednesday, February 01, 2006, 8:00:04 AM, you wrote:
 DBS Here's a brief introduction. I intend to write up (on the performance 
 page on
 DBS the wiki) a list of things we've done to improve the shootout entries. 
 N.B
 DBS we're now the 3rd *fastest* language, behind C and only a little behind 
 D (a C
 DBS varient) !!
 
 3rd fastest or 3rd overall, counting program lines and so on?
 
 

That's the unconceivable thing.  It is 3rd fastest.

Looking at just Full CPU Time:

C gcc   35.90   3
D Digital Mars  32.72   3
Haskell GHC 30.25   0
SML MLton   28.72   3
OCaml   27.92   1
Eiffel Smart26.17   6
C++ g++ 25.73   3
Nice24.43   4
Ada 95 GNAT 23.45   4
Clean   23.32   7
Java JDK 1.4 -server22.69   5
Java JDK -server22.39   5
Java JDK -client19.19   5
C# Mono 16.99   2

Only C gcc and D Digital Mars are ahead.

Looking at Just Memory Use, Haskell is 8th

languagescore   missing
C gcc   39.00   3
D Digital Mars  29.21   3
Forth GForth28.63   2
Ada 95 GNAT 27.12   4
Pascal Free 26.53   7
Eiffel Smart24.53   6
C++ g++ 24.46   3
Haskell GHC 24.28   0
OCaml   21.55   1
Fortran G95 20.21   6
Lua 19.63   2
SML MLton   17.80   3

Looking at Just Lines Of Code, Haskell is 1st by a mile:

Haskell GHC 41.84   0
SML MLton   34.47   3
Forth GForth32.50   2
OCaml   30.86   1
Tcl 30.83   3
Python Psyco30.49   0
Python  30.33   1
Lua 29.17   2
Ruby27.69   4
Perl25.50   5
Nice25.09   4
C# Mono 24.59   2
D Digital Mars  22.79   3
C++ g++ 22.60   3
Java JDK -client21.12   5
Java JDK -Xint  21.12   5
Java JDK 1.4 -server21.12   5
Java JDK -server21.12   5
C gcc   20.98   3


Where I had to include lots of languages to get down to C gcc.

Lookat at the 1:1:1 even balance of the above three, Haskell is 1st:

Haskell GHC 96.37   0
C gcc   95.87   3
D Digital Mars  84.72   3
SML MLton   81.00   3
OCaml   80.33   1
Forth GForth75.17   2
C++ g++ 72.79   3

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-02-01 Thread Isaac Gouy
--- Chris Kuklewicz [EMAIL PROTECTED]
wrote:
-snip,snip-
 It is 3rd fastest.
 Looking at Just Memory Use, Haskell is 8th
 Looking at Just Lines Of Code, Haskell is 1st 
 Lookat at the 1:1:1 even balance Haskell is 1st

Programmer skill and effort really does matter ;-)

Congratulations.

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-01-31 Thread Donald Bruce Stewart
joelkoerwer:
 
Thanks Chris. I was actually asking about analyzing Core
output in general. I'm well aware of the problems we're
having with the nbody entry.
I'm convinced my list based version can go faster than it is
now. That's why I was asking if Don could put together a few
notes on how to optimize inner loops using -ddump-simpl and
the resulting Core code.

Here's a brief introduction. I intend to write up (on the performance page on
the wiki) a list of things we've done to improve the shootout entries. N.B
we're now the 3rd *fastest* language, behind C and only a little behind D (a C
varient) !!

Consider the partial sums problem: 
wiki: http://www.haskell.org/hawiki/PartialSumsEntry
site: 
http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsumslang=ghcid=2

What follows is a discussion of the steps I took to improve the
performance of this code.

Here's the naive translation of the Clean entry (which was fairly quick):
Lots of math in a tight loop.
  
 import System; import Numeric
 
 main = do n - getArgs = readIO . head
   let sums = loop 1 n 1 0 0 0 0 0 0 0 0 0
   fn (s,t) = putStrLn $ (showFFloat (Just 9) s []) ++ \t ++ t
   mapM_ (fn :: (Double, String) - IO ()) (zip sums names)
 
 names = [(2/3)^k, k^-0.5, 1/k(k+1), Flint Hills, Cookson Hills
 , Harmonic, Riemann Zeta, Alternating Harmonic, Gregory]
 
 loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
 | k  n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
 | otherwise = loop (k+1) n (-alt)
(a1 + (2/3) ** (k-1))
(a2 + k ** (-0.5))
(a3 + 1 / (k * (k + 1)))
(a4 + 1 / (k*k*k * sin k * sin k))
(a5 + 1 / (k*k*k * cos k * cos k))
(a6 + 1 / k)
(a7 + 1 / (k*k))
(a8 + alt / k)
(a9 + alt / (2 * k - 1))

Compiled with -O2. However, the performance is _really_ bad :/ Somewhere
greater than 128M heap, in fact eventually running out of memory on my laptop.
A classic space leak.

(2) So look at the generated core. ghc -o naive Naive.hs -O2  -ddump-simpl | 
less
And we find that our loop has the following type:

 $sloop_r2U6 :: GHC.Prim.Double#
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Prim.Double#
- [GHC.Float.Double]

Hmm. Ok, I certainly don't want boxed doubles in such a tight loop.

(3) My next step is to encourage GHC to unbox this loop, by providing some
strictness annotations. Now the loop looks like this:

 loop k n alt a1 a2 a3 a4 a5 a6 a7 a8 a9
 | () !k !n !False = undefined
 | k  n = [ a1, a2, a3, a4, a5, a6, a7, a8, a9 ]
 | otherwise = loop (k+1) n (-alt)
(a1 + (2/3) ** (k-1))
(a2 + k ** (-0.5))
(a3 + 1 / (k * (k + 1)))
(a4 + 1 / (k*k*k * sin k * sin k))
(a5 + 1 / (k*k*k * cos k * cos k))
(a6 + 1 / k)
(a7 + 1 / (k*k))
(a8 + alt / k)
(a9 + alt / (2 * k - 1)) where x ! y = x `seq` y

I've played a little game here, using ! for `seq`, reminiscent of the new
!-pattern proposal for strictness. Let's see how this compiles. Here's the Core:

 $sloop_r2Vh :: GHC.Prim.Double#
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Float.Double
- GHC.Prim.Double#
- [GHC.Float.Double]

Ok, so it unboxed one extra argument. Let's see if we can get them all unboxed.
Strictify all args, and GHC produces an inner loop of:

 $sloop_r2WS :: GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- GHC.Prim.Double#
- [GHC.Float.Double]

Ah! perfect. Let's see how that runs:
$ ghc Naive.hs -O2 -no-recomp   
$ time ./a.out 250
3.0 (2/3)^k
3160.817621887  

Re: [Haskell-cafe] Known Unknowns

2006-01-28 Thread Joel Koerwer
Thanks Chris. I was actually asking about analyzing Core output in general. I'm well aware of the problems we're having with the nbody entry.I'm convinced my list based version can go faster than it is now. That's why I was asking if Don could put together a few notes on how to optimize inner loops using -ddump-simpl and the resulting Core code.
So, I guess my request is along the same lines as your earlier one: Also, could you explain how to check the Core (un)boxing in a note on the (new?) wiki? I would be interested in learning that trick.
I would love to have a tutorial about common situations one comes across when looking at the Core output.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-01-27 Thread Chris Kuklewicz
Joel Koerwer wrote:
 On 1/26/06, *Donald Bruce Stewart* [EMAIL PROTECTED]
 mailto:[EMAIL PROTECTED] wrote: 
 
 Ah, i just do: ghc A.hs -O2 -ddump-simpl | less
 and then read the Core, keeping an eye on the functions I'm interested
 in, and checking they're compiling to the kind of loops I'd write by
 hand. This is particularly useful for the kinds of tight numeric loops
 used in some of the shootout entries.
 
 Cheers,
   Don
 
 
 
 In that case could you describe the kind of loops you'd write by hand?

See below for the pseudo-code loop and the Haskell version.

 Seriously. And perhaps typical problems/fixes when the compiler doesn't
 produce what you want.

We don't have any fixes.

 
 Thanks,
 Joel

More discussion and code is at http://haskell.org/hawiki/NbodyEntry

The compiler produces code that runs 4 times slower than OCaml in our current
best attempt at programming against a 40 element (IOUArray Int Double).  The
final programs speed is very architecture dependent, but more frustrating is
that small referentially transparent changes to the source code produce up to
factor-of-two fluctuations in run time.

The small numeric functions in the shootout, where there is a recursive function
with 1 or 2 parameters (Double's), perform quite well.  But manipulating this
medium number of Double's to model the solar system has been too slow.

The main loop for the 5 planets looks quite simple in pseudo-c:

deltaTime = 0.01
for (i=0 ; i5; ++i) {
  get mass m, position (x,y,z), velocity (vx,vy,vz) of particle number i

  for (j=(i+1); j5; ++j) {
get mass, position, velocity of particle j

dxyx = position of i - position of j
mag = deltaTime /(length of dxyz)^3

velocity of j += mass of i * mag * dxyz
velocity of i -= mass of j * mag * dxyz
  }

  position of i += deltaTime * velocity of i
}

Note that the inner loop for j starts a j=(i+1).

The best performing Haskell code, for this loop, so far is:

-- Offsets for each field
x = 0; y = 1; z = 2; vx= 3; vy= 4; vz= 5; m = 6
-- This is the main code. Essentially all the time is spent here
advance n = when (n  0) $ updateVel 0  advance (pred n)

  where updateVel i = when (i = nbodies) $ do
let i' = (.|. shift i 3)
im  - unsafeRead b (i' m)
ix  - unsafeRead b (i' x)
iy  - unsafeRead b (i' y)
iz  - unsafeRead b (i' z)
ivx - unsafeRead b (i' vx)
ivy - unsafeRead b (i' vy)
ivz - unsafeRead b (i' vz)

let updateVel' ivx ivy ivz j =  ivx `seq` ivy `seq` ivz `seq`
  if j  nbodies then do
unsafeWrite b (i' vx) ivx
unsafeWrite b (i' vy) ivy
unsafeWrite b (i' vz) ivz
  else do
let j' = (.|. shiftL j 3)
jm - unsafeRead b (j' m)
dx - liftM (ix-) (unsafeRead b (j' x))
dy - liftM (iy-) (unsafeRead b (j' y))
dz - liftM (iz-) (unsafeRead b (j' z))
let distance = sqrt (dx*dx+dy*dy+dz*dz)
mag = 0.01 / (distance * distance * distance)
addScaled3 (3 .|. (shiftL j 3)) ( im*mag) dx dy dz
let a = -jm*mag
ivx' = ivx+a*dx
ivy' = ivy+a*dy
ivz' = ivz+a*dz
updateVel' ivx' ivy' ivz' $! (j+1)

updateVel' ivx ivy ivz $! (i+1)
addScaled (shiftL i 3) 0.01 (3 .|. (shiftL i 3))
updateVel (i+1)

-- Helper functions

addScaled i a j | i `seq` a `seq` j `seq` False = undefined -- stricitfy
addScaled i a j = do set i1 = liftM2 scale (unsafeRead b i1) (unsafeRead b j1)
 set i2 = liftM2 scale (unsafeRead b i2) (unsafeRead b j2)
 set i3 = liftM2 scale (unsafeRead b i3) (unsafeRead b j3)
where scale old new = old + a * new
  i1 = i; i2 = succ i1; i3 = succ i2;
  j1 = j; j2 = succ j1; j3 = succ j2;

addScaled3 i a jx jy jz | i `seq` a `seq` jx `seq` jy `seq` jz `seq` False =
undefined
addScaled3 i a jx jy jz = do set i1 = liftM (scale jx) (unsafeRead b i1)
 set i2 = liftM (scale jy) (unsafeRead b i2)
 set i3 = liftM (scale jz) (unsafeRead b i3)
where scale new old = a * new + old
  i1 = i; i2 = succ i1; i3 = succ i2;

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-01-26 Thread Chris Kuklewicz
Donald Bruce Stewart wrote:
 haskell:
 There is a new combined benchmark, partial sums that subsumes several 
 earlier
 benchmarks and runs 9 different numerical calculations:

 http://haskell.org/hawiki/PartialSumsEntry
 
 Ah! I had an entry too. I've posted it on the wiki.  I was careful to
 watch that all loops are compiled into nice unboxed ones in the Core. It
 seems to run a little bit faster than your more abstracted code.
 
 Timings on the page.
 
 Also, -fasm seems to only be a benefit on the Mac, as you've pointed out
 previously. Maybe you could check the times on the Mac too?
 
 -- Don
 

Yeah. I had not tried all the compiler options. Using -fasm is slower on this
for me as well.  I suspect that since your code will beat the entries that have
been posted so far, so I thin you should submit it.

Also, could you explain how to check the Core (un)boxing in a note on the (new?)
wiki?  I would be interested in learning that trick.

-- 
Chris
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-01-26 Thread Donald Bruce Stewart
haskell:
 Donald Bruce Stewart wrote:
  haskell:
  There is a new combined benchmark, partial sums that subsumes several 
  earlier
  benchmarks and runs 9 different numerical calculations:
 
  http://haskell.org/hawiki/PartialSumsEntry
  
  Ah! I had an entry too. I've posted it on the wiki.  I was careful to
  watch that all loops are compiled into nice unboxed ones in the Core. It
  seems to run a little bit faster than your more abstracted code.
  
  Timings on the page.
  
  Also, -fasm seems to only be a benefit on the Mac, as you've pointed out
  previously. Maybe you could check the times on the Mac too?
  
  -- Don
  
 
 Yeah. I had not tried all the compiler options. Using -fasm is slower on this
 for me as well.  I suspect that since your code will beat the entries that 
 have
 been posted so far, so I thin you should submit it.

ok, I'll submit it.
 
 Also, could you explain how to check the Core (un)boxing in a note on the 
 (new?)
 wiki?  I would be interested in learning that trick.

Ah, i just do: ghc A.hs -O2 -ddump-simpl | less
and then read the Core, keeping an eye on the functions I'm interested
in, and checking they're compiling to the kind of loops I'd write by
hand. This is particularly useful for the kinds of tight numeric loops
used in some of the shootout entries.

Cheers,
  Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Known Unknowns

2006-01-25 Thread Donald Bruce Stewart
haskell:
 There is a new combined benchmark, partial sums that subsumes several 
 earlier
 benchmarks and runs 9 different numerical calculations:
 
 http://haskell.org/hawiki/PartialSumsEntry

Ah! I had an entry too. I've posted it on the wiki.  I was careful to
watch that all loops are compiled into nice unboxed ones in the Core. It
seems to run a little bit faster than your more abstracted code.

Timings on the page.

Also, -fasm seems to only be a benefit on the Mac, as you've pointed out
previously. Maybe you could check the times on the Mac too?

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe