Re: Low-level array performance

2008-06-18 Thread Daniel Fischer
Am Dienstag, 17. Juni 2008 22:37 schrieb Dan Doel:
 I'll attach new, hopefully bug-free versions of the benchmark to this
 message.

With -O2 -fvia-C -optc-O3, the difference is small (less than 1%), but today, 
ByteArr is faster more often.


 Of course, without the list overhead, the ByteArr appears to allocate much
 more than Ptr for large arrays, because the n*w byte array shows up in the
 heap allocation, whereas the malloced memory does not. None of this should
 be a factor in the actual fannkuch benchmark, of course, which only
 allocates 3 arrays of size 11.

 Cheers,
 -- Dan

Cheers,
Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-18 Thread Dan Doel
On Wednesday 18 June 2008, Daniel Fischer wrote:
 Am Dienstag, 17. Juni 2008 22:37 schrieb Dan Doel:
  I'll attach new, hopefully bug-free versions of the benchmark to this
  message.

 With -O2 -fvia-C -optc-O3, the difference is small (less than 1%), but
 today, ByteArr is faster more often.

Hmm, well, I'm a bit flummoxed. I still get Addr# outperforming MBA# by 
perhaps 10% - 15%, even with -fvia-C -optc-O3 (and before the slight speedup 
below). Perhaps gcc's optimizer isn't doing as good a job for me for some 
reason.

In any case, I've entered a bug for this on the GHC trac:

  http://hackage.haskell.org/trac/ghc/ticket/2374

It contains a Ptr benchmark that performs slightly faster on very small arrays 
(under, say, 40 elements; I noticed such runs were taking more time than 
those with larger arrays with correspondingly fewer iterations, so I 
eliminated the replicateM_ in favor of an explicit loop. It gains a little 
time on the small arrays, but not enough to match the performance on the 
larger arrays, so I guess there are yet more factors. :) In any case, it 
makes it closer to being the same code as ByteArr).

The bug is filed against the native code generator, since it shows up more 
clearly there. I haven't gotten to looking at C-- or assembly yet, but 
hopefully I will in the near future. I'll try to do further followup on the 
bug report, since that's probably easier for the developers to keep track of.

Cheers,
-- Dan
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Simon Marlow

Dan Doel wrote:


Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr#

This is, I think, the crux of the issue. The main content of the benchmark is 
reversing/shifting items in an array. To get a somewhat easier look at the 
core, I boiled things down to a benchmark that just reverses a small array 
many times. In the interest of further reducing things, I wrote a version of 
the benchmark that uses raw Addr#s, and a version that uses raw 
MutableByteArray#s. I've attached both versions.


So I tried your examples and the Addr# version looks slower than the MBA# 
version:


$ ./Ptr 100 100 +RTS -sstderr
Done.
 48,196,560 bytes allocated in the heap
 27,381,764 bytes copied during GC (scavenged)
 18,260,784 bytes copied during GC (not scavenged)
 14,389,248 bytes maximum residency (5 sample(s))

 92 collections in generation 0 (  0.09s)
  5 collections in generation 1 (  0.13s)

 28 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time0.68s  (  0.69s elapsed)
  GCtime0.22s  (  0.28s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time0.90s  (  0.97s elapsed)


$ ./ByteArr 100 100 +RTS -sstderr
Done.
  4,042,700 bytes allocated in the heap
  1,272 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 16,384 bytes maximum residency (1 sample(s))

  2 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

  5 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time0.53s  (  0.54s elapsed)
  GCtime0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time0.53s  (  0.54s elapsed)

I tried with 6.8.2 and 6.8.3, using -O2 in both cases.  I tried the Ptr 
version with and without -fvia-C -optc-O2, no difference.


Are these exactly the same programs you measured?  What parameters did you use?

Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Dan Doel
On Tuesday 17 June 2008, Simon Marlow wrote:
 So I tried your examples and the Addr# version looks slower than the MBA#
 version:

Hmm...

 I tried with 6.8.2 and 6.8.3, using -O2 in both cases.  I tried the Ptr
 version with and without -fvia-C -optc-O2, no difference.

I had forgotten about the via-c in the pragma when I sent it, but I've tested 
it both via-c and with the new backend (and triple checked since your 
message), and I always come away with the Ptr version being faster. -fvia-c 
doesn't seem to affect the speed of the Addr# version much, while it improves 
the speed of the MBA# version. However, even with the improved speed, Addr# 
seems to edge it out here.

With the new backend, I get the results I sent in my initial mail. The 
ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250 
million times, whereas the Addr# version takes around 7 seconds.

(I also noticed a bug I'd missed before sending the ByteArray version. It 
should allocate based on w, but I left it hard coded to 4# when I was 
experimenting. This was causing segmentation faults on large arrays on my 
machine, since I'm running in 64-bit mode, and 8# is the correct value here. 
Are you running in 32-bit, and if so, could that be the source of our 
discrepancy?)

 Are these exactly the same programs you measured?  What parameters did you
 use?

Aside from the couple oversights above, yes. The actual fannkuch benchmark 
doesn't use very large arrays. The current test input is n = 11, and all the 
arrays it uses are length n. It gets its work from copying, reversing and 
shifting (portions of) those arrays n! or more times. So, I thought it'd be 
truer to the benchmark to reverse a small array many times. I've been running 
with command lines like './ByteArr 25000 10', which says to reverse a 
size-10 array 250 million times.

I tested with other sizes, and things seem to stay about the same increasing 
the array size and decreasing the iterations by the same factor, until I got 
to an array size of around 100,000, at which point there's a drop off for 
both (Addr# still being faster). I assume that's due to cache effects.

Here's some example runs, using '--make -O2' for both (OPTIONS pragma changed 
to only have -fglasgow-exts for both, and the w bug fixed).

./ByteArr 25000 10 +RTS -sstderr
Done.
 56,824 bytes allocated in the heap
552 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 45,056 bytes maximum residency (1 sample(s))

  1 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

  1 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time   10.35s  ( 11.15s elapsed)
  GCtime0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time   10.36s  ( 11.15s elapsed)

  %GC time   0.0%  (0.0% elapsed)

  Alloc rate5,486 bytes per MUT second

  Productivity 100.0% of total user, 92.9% of total elapsed

./Ptr 25000 10 +RTS -sstderr
Done.
 57,840 bytes allocated in the heap
552 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 45,056 bytes maximum residency (1 sample(s))

  1 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

  1 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time6.53s  (  7.05s elapsed)
  GCtime0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time6.53s  (  7.05s elapsed)

  %GC time   0.0%  (0.0% elapsed)

  Alloc rate8,854 bytes per MUT second

  Productivity 100.0% of total user, 92.7% of total elapsed

As I mentioned before, using -fvia-c -optc-O2 leaves Ptr unchanged, and speeds 
up ByteArr, but not enough to catch up with Ptr (here, at least).

Anyhow, my apologies for the mistakes above, and thanks for your time and 
assistance. I'll try puzzling over the C-- some and probably open a trac 
ticket a bit later as the other Simon suggested (if that's still 
appropriate).

Thanks again,
-- Dan
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Daniel Fischer
Am Dienstag, 17. Juni 2008 18:32 schrieb Dan Doel:
 On Tuesday 17 June 2008, Simon Marlow wrote:
  So I tried your examples and the Addr# version looks slower than the MBA#
  version:

 Hmm...

  I tried with 6.8.2 and 6.8.3, using -O2 in both cases.  I tried the Ptr
  version with and without -fvia-C -optc-O2, no difference.

 I had forgotten about the via-c in the pragma when I sent it, but I've
 tested it both via-c and with the new backend (and triple checked since
 your message), and I always come away with the Ptr version being faster.
 -fvia-c doesn't seem to affect the speed of the Addr# version much, while
 it improves the speed of the MBA# version. However, even with the improved
 speed, Addr# seems to edge it out here.

 With the new backend, I get the results I sent in my initial mail. The
 ByteArray version takes 11 - 12 seconds to reverse a size 10 array 250
 million times, whereas the Addr# version takes around 7 seconds.


I've experimented a bit and found that Ptr is faster for small arrays (only 
very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs 
much better for larger arrays

[EMAIL PROTECTED]:~/Documents/haskell/move ./PtrC +RTS -sstderr -RTS 20 
1000
./PtrC 20 1000 +RTS -sstderr
Done.
481,596,836 bytes allocated in the heap
257,665,360 bytes copied during GC (scavenged)
171,919,440 bytes copied during GC (not scavenged)
117,149,696 bytes maximum residency (8 sample(s))

919 collections in generation 0 (  3.44s)
  8 collections in generation 1 ( 24.99s)

226 Mb total memory in use

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time8.16s  (  9.06s elapsed)
  GCtime   28.43s  ( 30.11s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time   36.59s  ( 39.16s elapsed)

  %GC time  77.7%  (76.9% elapsed)

  Alloc rate59,019,220 bytes per MUT second

  Productivity  22.3% of total user, 20.8% of total elapsed

[EMAIL PROTECTED]:~/Documents/haskell/move ./ByteArrC +RTS -sstderr -RTS 20 
1000
./ByteArrC 20 1000 +RTS -sstderr
Done.
 40,041,976 bytes allocated in the heap
  1,272 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 16,384 bytes maximum residency (1 sample(s))

  2 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

 40 Mb total memory in use

  INIT  time0.00s  (  0.02s elapsed)
  MUT   time5.03s  (  5.32s elapsed)
  GCtime0.00s  (  0.01s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time5.03s  (  5.35s elapsed)

  %GC time   0.0%  (0.3% elapsed)

  Alloc rate7,960,631 bytes per MUT second

  Productivity 100.0% of total user, 94.0% of total elapsed

Using GHC 6.8.2

The GC time for the Addr# version is frightening
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Dan Doel
On Tuesday 17 June 2008, Daniel Fischer wrote:
 I've experimented a bit and found that Ptr is faster for small arrays (only
 very slightly so if compiled with -fvia-C -optc-O3), but ByteArr performs
 much better for larger arrays
 ...
 The GC time for the Addr# version is frightening

I had an entire e-mail written about what a bizarre and interesting result 
you'd just found, but unfortunately, I then remembered exactly how the array 
gets filled in the Ptr version. Namely:

(Ptr a) - newArray [0..n-1]

Which, I assume does something terrible, like calling length to get the size 
needed for the array, while also needing the values after array creation to 
feed into it. For small arrays like the ones I'd been testing with, this 
doesn't matter, because the work done on the list is negligible. However, 
when you get to very large lists (100,000 elements and above, apparently) 
this starts causing massive space leaks, which explains the terrible GC 
behavior we were seeing.

If you change the benchmark like so:

bench :: Int - Int - IO ()
bench k n = do (Ptr a) - mallocArray n :: IO (Ptr Int)
   fill a n
   replicateM_ k (reverse a 0 (n-1))

fill :: Addr# - Int - IO ()
fill a (I# n) = IO (go 0#)
 where
 go i s
   | i # n= case writeIntOffAddr# a i i s of s - go (i +# 1#) s
   | otherwise = (# s, () #)

The space leak goes away, and the runtimes stay consistent. Up to around 
10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around 11. 
At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays there 
(again, I imagine due to running into bad cache effects at that level). This 
is all for size * iterations = 2.5 billion on my machine.

A good catch anyhow, though. That could explain why Simon Marlow was seeing 
the Addr# version as slower, since he was using a large array, and thus work 
done on the list could have contributed significantly (although MUT time was 
higher with Ptr, so it would have had to contribute work there, not just GC 
thrashing).

Thanks for the input,
-- Dan
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread haskell

I see that Dan Doel's post favoring Ptr/Addr#
has the same allocation amounts (from +RTS -sstderr) for Ptr/Addr# and the 
MutableByteArray#


Everyone else sees more allocation for Ptr/Addr# than MBA# and see MBA# as 
faster in these cases.


I myself (on G4) see more allocation [just like Simon Marlow] for Ptr/Addr# and 
find it slower.  If I boost the initial memory with -A 100m then Ptr still 
allocated more, but the timing difference becomes quite small:


The Ptr/Addr# code now runs in:

pamac-cek10:tmp chrisk$ time ./addr 100 100 +RTS -sstderr -A100m
./a 100 100 +RTS -sstderr -A100m
Done.
 48,182,068 bytes allocated in the heap
276 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 20,480 bytes maximum residency (1 sample(s))

  1 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

 97 Mb total memory in use

  INIT  time0.00s  (  0.01s elapsed)
  MUT   time1.54s  (  2.43s elapsed)
  GCtime0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time1.55s  (  2.44s elapsed)

  %GC time   0.2%  (0.1% elapsed)

  Alloc rate31,205,254 bytes per MUT second

  Productivity  99.6% of total user, 63.1% of total elapsed


real0m2.728s
user0m1.548s
sys 0m0.207s

And the MutableByteArray# code now runs in:

pamac-cek10:tmp chrisk$ time ./mba 100 100 +RTS -sstderr -A100m
./m 100 100 +RTS -sstderr -A100m
Done.
  4,023,784 bytes allocated in the heap
276 bytes copied during GC (scavenged)
  0 bytes copied during GC (not scavenged)
 20,480 bytes maximum residency (1 sample(s))

  1 collections in generation 0 (  0.00s)
  1 collections in generation 1 (  0.00s)

101 Mb total memory in use

  INIT  time0.00s  (  0.01s elapsed)
  MUT   time1.50s  (  2.30s elapsed)
  GCtime0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time1.51s  (  2.32s elapsed)

  %GC time   0.3%  (0.2% elapsed)

  Alloc rate2,668,201 bytes per MUT second

  Productivity  99.6% of total user, 65.0% of total elapsed


real0m2.335s
user0m1.513s
sys 0m0.049s


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Daniel Fischer
Am Dienstag, 17. Juni 2008 20:35 schrieb Dan Doel:
 On Tuesday 17 June 2008, Daniel Fischer wrote:
  I've experimented a bit and found that Ptr is faster for small arrays
  (only very slightly so if compiled with -fvia-C -optc-O3), but ByteArr
  performs much better for larger arrays
  ...
  The GC time for the Addr# version is frightening

 I had an entire e-mail written about what a bizarre and interesting result
 you'd just found, but unfortunately, I then remembered exactly how the
 array gets filled in the Ptr version. Namely:

 (Ptr a) - newArray [0..n-1]

Ouch. I should've looked at both sources, that would have been obvious then :)


 Which, I assume does something terrible, like calling length to get the
 size needed for the array, while also needing the values after array
 creation to feed into it. For small arrays like the ones I'd been testing
 with, this doesn't matter, because the work done on the list is negligible.
 However, when you get to very large lists (100,000 elements and above,
 apparently) this starts causing massive space leaks, which explains the
 terrible GC behavior we were seeing.

 If you change the benchmark like so:

 bench :: Int - Int - IO ()
 bench k n = do (Ptr a) - mallocArray n :: IO (Ptr Int)
fill a n
replicateM_ k (reverse a 0 (n-1))

 fill :: Addr# - Int - IO ()
 fill a (I# n) = IO (go 0#)
  where
  go i s

| i # n= case writeIntOffAddr# a i i s of s - go (i +# 1#) s
| otherwise = (# s, () #)

 The space leak goes away, and the runtimes stay consistent. Up to around
 10,000 elements, Ptr hovers around 6s, and ByteArray (-fasm) stays around
 11. At 100,000, Ptr jumps to 12-13s, and ByteArray goes to 14-16 and stays
 there (again, I imagine due to running into bad cache effects at that
 level). This is all for size * iterations = 2.5 billion on my machine.

Since my computer is slower, I confined my tests to size*iterations ~= 10^9.
Mostly, I find no noticeable difference, between 0.2 and 0.5 %, sometimes one 
faster, sometimes the other. I have the impression that the Ptr version is 
the faster more often than the ByteArr version, but the tendency isn't very 
strong. 
That applies to both compiled with -O2 -fvia-C -optc-O3. 
Compiling with -O2 -fasm doesn't make a noticeable difference for Ptr, but is 
about 13% slower for ByteArr when the arrays are large (too large for the 
cache, I suppose) and about 50% slower for small arrays.

So what I can read off that is that the native code generator still has to 
catch up for such code, not that either way of implementing arrays is 
inherently faster.


 A good catch anyhow, though. That could explain why Simon Marlow was seeing
 the Addr# version as slower, since he was using a large array, and thus
 work done on the list could have contributed significantly (although MUT
 time was higher with Ptr, so it would have had to contribute work there,
 not just GC thrashing).

 Thanks for the input,
 -- Dan

Cheers,
Daniel

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Low-level array performance

2008-06-17 Thread Dan Doel
On Tuesday 17 June 2008, [EMAIL PROTECTED] wrote:
 I see that Dan Doel's post favoring Ptr/Addr#
 has the same allocation amounts (from +RTS -sstderr) for Ptr/Addr# and the
 MutableByteArray#

 Everyone else sees more allocation for Ptr/Addr# than MBA# and see MBA# as
 faster in these cases.

 I myself (on G4) see more allocation [just like Simon Marlow] for Ptr/Addr#
 and find it slower.  If I boost the initial memory with -A 100m then Ptr
 still allocated more, but the timing difference becomes quite small:

Pardon my noise, but is this still with the version of Ptr.hs that would (in 
your case) allocate a 1 million element list and traverse it twice, or the 
revision that fills the array in a loop with an Int#?

If it's the former, and Addr# is tying MutableByteArray# even with operations 
on a 40-some megabyte list (if the allocation is any indication), then the 
actual Addr# operations are probably faster for you, too. :)

I'll attach new, hopefully bug-free versions of the benchmark to this message.

Of course, without the list overhead, the ByteArr appears to allocate much 
more than Ptr for large arrays, because the n*w byte array shows up in the 
heap allocation, whereas the malloced memory does not. None of this should be 
a factor in the actual fannkuch benchmark, of course, which only allocates 3 
arrays of size 11.

Cheers,
-- Dan
{-# OPTIONS_GHC -fglasgow-exts #-}

module Main (main) where

import Prelude hiding (reverse)

import Control.Monad.ST

import GHC.ST
import GHC.Base
import GHC.Prim

import Foreign (sizeOf)

import System.Environment

reverse :: MutableByteArray# s - Int# - Int# - State# s - State# s
reverse arr = go
 where
 go i j s
   | i # j= case readIntArray#  arr is of { (# s, ei #) -
 case readIntArray#  arr js of { (# s, ej #) -
 case writeIntArray# arr j ei s of { s -
 case writeIntArray# arr i ej s of { s -
 go (i +# 1#) (j -# 1#) s } } } }
   | otherwise = s
{-# INLINE reverse #-}

fill :: MutableByteArray# s - Int# - State# s - State# s
fill arr n = go 0#
 where
 go i s
   | i # n= case writeIntArray# arr i i s of { s -
 go (i +# 1#) s }
   | otherwise = s
{-# INLINE fill #-}

bench :: Int - Int - ST s ()
bench (I# k) (I# n) = ST go
 where
 go s = case sizeOf (0 :: Int)of { I# w -
case newByteArray# (n *# w) s of { (# s, arr #) -
case fill arr n s of { s -
 go' arr k s } } }
 go' arr = go''
  where
  go'' 0# s = (# s, () #)
  go'' k  s = case reverse arr 0# (n -# 1#) s of { s -
  go'' (k -# 1#) s }
{-# INLINE bench #-}

main = do (k:n:_) - map read `fmap` getArgs
  stToIO (bench k n)
  putStrLn Done.
{-# OPTIONS_GHC -fglasgow-exts #-}

module Main (main) where

import Prelude hiding (reverse)

import Control.Monad

import GHC.Base
import GHC.IOBase
import GHC.Ptr

import Foreign

import System.Environment

reverse :: Addr# - Int - Int - IO ()
reverse a (I# i) (I# j) = IO $ \s - reverse' a i j s


reverse' a i j s
  | i # j = case readIntOffAddr# a i sof { (# s, x #) -
 case readIntOffAddr# a j sof { (# s, y #) -
 case writeIntOffAddr# a j x s of { s -
 case writeIntOffAddr# a i y s of { s -
 reverse' a (i +# 1#) (j -# 1#) s 
  | otherwise = (# s, () #)

bench :: Int - Int - IO ()
bench k n = do p@(Ptr a) - mallocArray n :: IO (Ptr Int)
   fill a n
   replicateM_ k (reverse a 0 (n-1))
   free p

fill :: Addr# - Int - IO ()
fill a (I# n) = IO (go 0#)
 where
 go i s
   | i # n= case writeIntOffAddr# a i i s of s - go (i +# 1#) s
   | otherwise = (# s, () #)

main = do (k:n:_) - map read `fmap` getArgs
  bench k n
  putStrLn Done.___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


RE: Low-level array performance

2008-06-16 Thread Simon Peyton-Jones
Dan

John Dias is indeed spending 6 months at Microsoft to work on GHC's back end.  
He's doing a pretty wholesale re-architecting job, so it will be a couple of 
months before we have the new setup glued together, but once we do I hope that 
we'll have a much more friendly framework in place for doing good optimizations.

Meanwhile, concrete examples like yours are very useful.  It would be still 
more useful if you and Don could pinpoint more accurately just what the 
difference (say) between Addr# and MutableByteArray# is. Is it array bounds 
checking, for example?  Looking first at the Core that is generated and then 
(if it looks the same on both cases) at the C--, is often illuminating.  If you 
can say for this little loop, GHC generates this stupid code sequence that's 
particularly helpful.   (Don often does this.)

Even if you don't, we'll still take a look in due course, but the more you can 
pin it down the more motivating it is for us to fix it!

Do start a Trac ticket so your thoughts and code examples are not lost in the 
welter of email.

Thanks

Simon

| -Original Message-
| From: [EMAIL PROTECTED] [mailto:glasgow-haskell-
| [EMAIL PROTECTED] On Behalf Of Dan Doel
| Sent: 16 June 2008 20:52
| To: glasgow-haskell-users@haskell.org
| Subject: Low-level array performance
|
| Greetings,
|
| Recently, due to scattered complaints I'd seen on the internet, I set about
| to
| rewrite the fannkuch [1] benchmark on the Great Computer Language Shootout.
| The current entry uses Ptr/Addr#, malloc, etc. so it's not particularly
| representative of code one would actually write in Haskell these days. Over
| the past few days, I've written several versions of the benchmark, and
| collaborated a bit with local speed guru Don Stewart, but an entry that bests
| the current entry does not seem to be in the cards currently. So, I thought
| I'd write about some of the issues, and hopefully get some encouraging news
| about the situation.
|
| Issue 1: STUArrays aren't optimized as fully as one might ideally expect.
|
| This isn't so much of an issue, I suppose, except possibly for an environment
| like the shootout. I wrote versions of the benchmark (and particular pieces
| of the overall benchmark, as well) for both STUArrays, and MUArrays from
| dons' new uvector [2] library. The STUArray versions are consistently slower,
| and a look at the core reveals that STUArray code contains significantly more
| boxing than uvector. This isn't a big deal, as there's no reason not to use
| uvector, aside from it not being blessed by being in the GHC distribution.
| However, I thought I'd bring it up in case anyone hasn't heard of the library
| yet.
|
| Issue 2: Reading from/writing to a MutableByteArray# is slower than an Addr#
|
| This is, I think, the crux of the issue. The main content of the benchmark is
| reversing/shifting items in an array. To get a somewhat easier look at the
| core, I boiled things down to a benchmark that just reverses a small array
| many times. In the interest of further reducing things, I wrote a version of
| the benchmark that uses raw Addr#s, and a version that uses raw
| MutableByteArray#s. I've attached both versions.
|
| The fannkuch benchmark (input 11) with byte arrays runs in around 12 seconds
| on my machine. To get the reversal benchmark to around that time, I can tell
| it to, for instance, reverse a 10-element array 250 million times. Those same
| inputs to the Addr# version of the reverse benchmark result in a runtime of 7
| to 8 seconds, so there's a significant discrepancy at the level the benchmark
| actually runs at (the current fannkuch entry that uses Addr#/Ptr takes around
| 9 seconds, so there's other stuff going on in the benchmark evening them out,
| but this is likely the source of the difference between the two).
|
| uvector performance on the reversal benchmark is comparable with
| MutableByteArray#, so the practical slowdown using a library for actual
| public consumption seems to be in the fact that it's based on byte arrays
| (STUArray again falls further behind by retaining too many boxes).
|
| So, to wrap things up: dons and I were rather curious about the performance
| differences between MutableByteArray# and Addr#, as one might expect them to
| be comparable, both being low-level, raw pointer type structures; the code
| for
| using the two is nearly identical.
|
| (If one were to raise an Issue 3, it'd likely be that, even with Addr# and
| malloc, which is currently the fastest Haskell solution, we're not that fast.
| A glance at that shootout list reveals that we're beat by a myriad of
| Java-based languages, Python (Psyco), Ada, Ocaml, Eiffel, Lisp (SBCL), Clean,
| and of course C++, C, Fortran, etc. Is there any hope of getting performance
| improvements that will push us up in that list? I've heard that people are
| working on improved code generation, but I don't know if that can help with
| this sort of thing (or if it's more of a runtime