Re: No last core parallel slowdown on OS X

2009-04-22 Thread Dave Bayer
My first post was comparing almost identical machines: Different Q6600  
steppings (the earlier chip makes a good space heater!) on different  
motherboards, same memory, both stock speeds.


In a few weeks when the semester ends, I'll be able to try Linux -vs-  
BSD -vs- OS X on identical hardware, and try Simon's settings.


(I do love overclocking, but five minutes improving Haskell code is  
generally more effective than a day tweaking motherboard voltages.  
We're too green to use A/C in the hot California summer, and this  
computer exhausts through a dryer hose out my office window as it is.  
I don't want it any hotter, I just want more cores!)


I do have some experience comparing this code on four different Linux  
boxes, and three different Macs, and Linux does consistently worse. I  
waited to post until I could compare 4 cores against 4 cores on nearly  
identical hardware.


Also, I tried many approaches to this code, and what I've been testing  
is my best version, which also happens to be one of the simplest  
approaches to parallelism. (It so often works that way with Haskell.)  
In fairness, I should also run the standard test suite used in the  
paper.


On Apr 21, 2009, at 10:14 AM, Tyson Whitehead wrote:

Why not try booting a CD or thumb-drive linux distro (e.g., ubuntu  
live) on
your 2.4 GHz Q6600 OS X box and see how things stack up.  It would  
certainly

eliminate any questions of hardware differences.

Cheers!  -Tyson


I can do even better: This $65 bay device takes four 2.5 SATA or SAS  
drives:


http://addonics.com/products/raid_system/ae4rcs25nsa.asp

It has a surprising build quality, makes it trivial to juggle 2.5  
SATA drives. Removing the high-low jumper disables the loud fan, which  
is probably only needed for SAS drives.


My primary drive is an OCZ Vertex SSD, for which this is perfect.  I  
also have an assortment of spare laptop drives I can use, so an OS  
survey will be easy.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-21 Thread Simon Marlow
2009/4/20 Dave Bayer ba...@cpw.math.columbia.edu:
 I ran some longer trials, and noticed a further pattern I wish I could
 explain:

 I'm comparing the enumeration of the roughly 69 billion atomic lattices on
 six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight
 core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that
 my processor now costs $200 (it's the venerable Dodge Dart of quad core
 chips), while the pair of Xeon processors cost $2400. The Haskell code is
 straightforward; it uses bit fields and reverse search, but it doesn't take
 advantage of symmetry, so it must touch every lattice to complete the
 enumeration. Its memory footprint is insignificant.

 Never mind 7 cores, Linux performs worse before it runs out of cores.
 Comparing 1, 2, 3, 4 cores on each machine, look at real and user time
 in minutes, and the ratio:

 Linux
 2 x 3.16 GHz Xeon X5460
 1       2       3       4
 466.7   250.8   183.7   149.3
 466.4   479.0   505.2   528.1
 1.00    1.91    2.75    3.54

 OS X
 2.4 GHx Q6600
 1       2       3       4
 676.9   359.4   246.7   191.4
 673.4   673.7   675.9   674.8
 0.99    1.87    2.74    3.53

 These ratios match up like physical constants, or at least invariants of my
 Haskell implementation. However, the user time is constant on OS X, so these
 ratios reflect the actual parallel speedup on OS X. The user time climbs
 steadily on Linux, significantly diluting the parallel speedup on Linux.
 Somehow, whatever is going wrong in the interaction between Haskell and
 Linux is being captured in this increase in user time.

We can't necessarily blame this on Linux: the two machines have
different hardware.  There could be cache-effects at play, for
example.

Maybe you could try the new affinity options (+RTS -qa) and see if
that makes any difference?  That would reduce the effect of scheduling
effects due to the OS (although when the number of cores you use is
less than the real number of cores in the machine, the OS is still
free to move threads around.  To get reliable numbers you should
really disable some of the cores at boot-time).

Cheers,
  Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-21 Thread Simon Marlow
2009/4/21 Don Stewart d...@galois.com:

 Little advice and tidbits are creeping out of Simon's head.

 Is it time for a parallel performance wiki, where every question that
 becomes an FAQ gets documented live?

    http://haskell.org/haskellwiki/Performance/Parallel

 Maybe put details on the wiki so we can grow a large FAQ to capture this
 oral tradition.

Absolutely.  One reservation I have is that advice is likely to go out
of date quite quickly; for example I'm planning to change the RTS
options again before we release 6.12.1 to improve the default
behaviour.

Another reservation I have is that it's very difficult to pin down
techniques that work consistently over different OSs and hardware.
The best we can do is to document the techniques we know about, and
advise people to try a variety of things to see which works best.
Even that would be better than nothing, of course.

Does anyone feel able to make a start setting up a wiki tree for
parallel performance?  I'd be more than happy to contribute and review
content.

Cheers,
   Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-21 Thread Don Stewart
marlowsd:
 2009/4/20 Dave Bayer ba...@cpw.math.columbia.edu:
  I ran some longer trials, and noticed a further pattern I wish I could
  explain:
 
  I'm comparing the enumeration of the roughly 69 billion atomic lattices on
  six atoms, on my four core, 2.4 GHz Q6600 box running OS X, against an eight
  core, 2 x 3.16 Ghz Xeon X5460 box at my department running Linux. Note that
  my processor now costs $200 (it's the venerable Dodge Dart of quad core
  chips), while the pair of Xeon processors cost $2400. The Haskell code is
  straightforward; it uses bit fields and reverse search, but it doesn't take
  advantage of symmetry, so it must touch every lattice to complete the
  enumeration. Its memory footprint is insignificant.
 
  Never mind 7 cores, Linux performs worse before it runs out of cores.
  Comparing 1, 2, 3, 4 cores on each machine, look at real and user time
  in minutes, and the ratio:
 
  Linux
  2 x 3.16 GHz Xeon X5460
  1       2       3       4
  466.7   250.8   183.7   149.3
  466.4   479.0   505.2   528.1
  1.00    1.91    2.75    3.54
 
  OS X
  2.4 GHx Q6600
  1       2       3       4
  676.9   359.4   246.7   191.4
  673.4   673.7   675.9   674.8
  0.99    1.87    2.74    3.53
 
  These ratios match up like physical constants, or at least invariants of my
  Haskell implementation. However, the user time is constant on OS X, so these
  ratios reflect the actual parallel speedup on OS X. The user time climbs
  steadily on Linux, significantly diluting the parallel speedup on Linux.
  Somehow, whatever is going wrong in the interaction between Haskell and
  Linux is being captured in this increase in user time.
 
 We can't necessarily blame this on Linux: the two machines have
 different hardware.  There could be cache-effects at play, for
 example.
 
 Maybe you could try the new affinity options (+RTS -qa) and see if
 that makes any difference?  That would reduce the effect of scheduling
 effects due to the OS (although when the number of cores you use is
 less than the real number of cores in the machine, the OS is still
 free to move threads around.  To get reliable numbers you should
 really disable some of the cores at boot-time).
 

Little advice and tidbits are creeping out of Simon's head.

Is it time for a parallel performance wiki, where every question that
becomes an FAQ gets documented live?

http://haskell.org/haskellwiki/Performance/Parallel

Maybe put details on the wiki so we can grow a large FAQ to capture this
oral tradition.

-- Don
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-21 Thread Tyson Whitehead
On April 21, 2009 04:39:40 Simon Marlow wrote:
  These ratios match up like physical constants, or at least invariants of
  my Haskell implementation. However, the user time is constant on OS X, so
  these ratios reflect the actual parallel speedup on OS X. The user time
  climbs steadily on Linux, significantly diluting the parallel speedup on
  Linux. Somehow, whatever is going wrong in the interaction between
  Haskell and Linux is being captured in this increase in user time.

 We can't necessarily blame this on Linux: the two machines have
 different hardware.  There could be cache-effects at play, for
 example.

Why not try booting a CD or thumb-drive linux distro (e.g., ubuntu live) on 
your 2.4 GHz Q6600 OS X box and see how things stack up.  It would certainly 
eliminate any questions of hardware differences.

Cheers!  -Tyson


signature.asc
Description: This is a digitally signed message part.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-20 Thread Simon Marlow

Manuel M T Chakravarty wrote:

Dave Bayer:
In that paper, they routinely benchmark N-1 cores on an N core Linux 
box, because of a noticeable falloff using the last core, which can do 
more harm than good. I had confirmed this on my four core Linux box, 
but was puzzled that my two core MacBook showed no such falloff. Hey, 
two cores isn't representative of many cores, cache issues yada yada, 
so I waited.

[..]
Compared to 2 cores, using 3, 4 cores on an equivalent four core box 
running OS X gives speedups of


1.45x, 1.9x


As another data point, in our work on Data Parallel Haskell, we ran 
benchmarks on an 8-core Xserve (OS X) and an 8-core Sun T2 (Solaris).  
On both machines, we had no problem using all 8 cores.


I suspect some scheduling weirdness in Linux, at least in the kernel we're 
using here (2.6.25).  Traces appeared to show that one of our threads was 
being descheduled for a few ms, and this can be particularly severe in GHC 
since our stop-the-world GC needs frequent synchronisations.  One advantage 
of moving to processor-independent GCs would be that we could degrade more 
gracefully if the CPUs are contended, or the OS scheduler just decides to 
use a core for something else for a while.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-20 Thread Dave Bayer

On Apr 19, 2009, at 9:59 PM, Tyson Whitehead wrote:

This leave me wondering how do the absolute numbers compare?  Could  
the extra
overhead due to the various 32bit issues be giving more room for  
better
threading performance?  What do you get if you use 32bit GHC with  
Linux?



Oddly enough, these are 32 bit GHC implementations in both cases. Our  
departmental sys admin has stayed with 32 bit Linux.


cores, real, user, ratio


Linux
2 x 3.16 GHz Xeon X5460
1   2   3   4
466.7   250.8   183.7   149.3
466.4   479.0   505.2   528.1
1.001.912.753.54

OS X
2.4 GHx Q6600
1   2   3   4
676.9   359.4   246.7   191.4
673.4   673.7   675.9   674.8
0.991.872.743.53

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-20 Thread Dave Bayer
[Sorry if this turns out to be a dup, it appears that my first send  
got lost, while my followup message went through.]


I ran some longer trials, and noticed a further pattern I wish I could  
explain:


I'm comparing the enumeration of the roughly 69 billion atomic  
lattices on six atoms, on my four core, 2.4 GHz Q6600 box running OS  
X, against an eight core, 2 x 3.16 Ghz Xeon X5460 box at my department  
running Linux. Note that my processor now costs $200 (it's the  
venerable Dodge Dart of quad core chips), while the pair of Xeon  
processors cost $2400. The Haskell code is straightforward; it uses  
bit fields and reverse search, but it doesn't take advantage of  
symmetry, so it must touch every lattice to complete the  
enumeration. Its memory footprint is insignificant.


Never mind 7 cores, Linux performs worse before it runs out of cores.  
Comparing 1, 2, 3, 4 cores on each machine, look at real and user  
time in minutes, and the ratio:


Linux
2 x 3.16 GHz Xeon X5460
1   2   3   4
466.7   250.8   183.7   149.3
466.4   479.0   505.2   528.1
1.001.912.753.54

OS X
2.4 GHx Q6600
1   2   3   4
676.9   359.4   246.7   191.4
673.4   673.7   675.9   674.8
0.991.872.743.53

These ratios match up like physical constants, or at least invariants  
of my Haskell implementation. However, the user time is constant on OS  
X, so these ratios reflect the actual parallel speedup on OS X. The  
user time climbs steadily on Linux, significantly diluting the  
parallel speedup on Linux. Somehow, whatever is going wrong in the  
interaction between Haskell and Linux is being captured in this  
increase in user time.


I love how my cheap little box comes close to pulling even with a  
departmental compute server I can't afford, because of this difference  
in operating systems.


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-19 Thread Manuel M T Chakravarty

Dave Bayer:
In that paper, they routinely benchmark N-1 cores on an N core Linux  
box, because of a noticeable falloff using the last core, which can  
do more harm than good. I had confirmed this on my four core Linux  
box, but was puzzled that my two core MacBook showed no such  
falloff. Hey, two cores isn't representative of many cores, cache  
issues yada yada, so I waited.

[..]
Compared to 2 cores, using 3, 4 cores on an equivalent four core box  
running OS X gives speedups of


1.45x, 1.9x


As another data point, in our work on Data Parallel Haskell, we ran  
benchmarks on an 8-core Xserve (OS X) and an 8-core Sun T2 (Solaris).   
On both machines, we had no problem using all 8 cores.


Manuel
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-19 Thread Tyson Whitehead
On April 18, 2009 16:46:44 Daniel Peebles wrote:
 That looks great! I wonder what about Mac OS leads to such good
 performance...

 Now if only we could get a nice x86_64-producing GHC for Mac OS too, I
 could use all my RAM and the extra registers my Mac Pro gives me :)

I was a bit surprised when I read the initial report because

1-  I thought GHC had a hard time with 32bit x86 code due to the integer 
register pressure and hacking around the stack based FPU, and

2-  I though OS X had multithreading performance issues (or at least that is 
what I had read in various reports regarding using it as a server).

This leave me wondering how do the absolute numbers compare?  Could the extra 
overhead due to the various 32bit issues be giving more room for better 
threading performance?  What do you get if you use 32bit GHC with Linux?

Cheers!  -Tyson



signature.asc
Description: This is a digitally signed message part.
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-18 Thread Daniel Peebles
That looks great! I wonder what about Mac OS leads to such good performance...

Now if only we could get a nice x86_64-producing GHC for Mac OS too, I
could use all my RAM and the extra registers my Mac Pro gives me :)

On Sat, Apr 18, 2009 at 2:39 PM, Dave Bayer ba...@cpw.math.columbia.edu wrote:
 I'm a huge fan of the recent paper

 http://ghcmutterings.wordpress.com/2009/03/03/new-paper-runtime-support-for-multicore-haskell/

 which put me over the top to get started writing parallel code in Haskell.
 Parallel code is now integral to my and my Ph.D. students' research. For
 example, we recently checked an assertion for the roughly 69 billion atomic
 lattices on six atoms, in a day rather than a week, using perhaps 6 lines of
 parallel code in otherwise sequential code. When you're anxiously waiting
 for the answer, a day is a lot better than a week. (The enumeration itself
 is down to two hours on 7 cores, which astounds me. I see no reason to ever
 use another language.)

 In that paper, they routinely benchmark N-1 cores on an N core Linux box,
 because of a noticeable falloff using the last core, which can do more harm
 than good. I had confirmed this on my four core Linux box, but was puzzled
 that my two core MacBook showed no such falloff. Hey, two cores isn't
 representative of many cores, cache issues yada yada, so I waited.

 I just got an EFi-X boot processor (efi-x.com) working on a nearly
 identical quad core box that I built, and I tested the same computations
 with OS X. For my test case, there's a mild cost to moving to parallel at
 all, but...

 Compared to 2 cores, using 3, 4 cores on a four core Linux box gives
 speedups of

        1.37x, 1.38x

 Compared to 2 cores, using 3, 4 cores on an equivalent four core box running
 OS X gives speedups of

        1.45x, 1.9x

 Here 1.5x, 2.0x is ideal, so I'm thrilled. If we can't shame Linux into
 fixing this, I'm never looking back. How true is this for other parallel
 languages? Haskell alone is perhaps too fringe to cause a Linux scandal over
 this, even if it should...

 The EFi-X boot processor itself is rather expensive ($240 now), and there's
 sticking to a specific hardware compatibility list, and I needed to update
 my motherboard BIOS and the EFi-X firmware, but no other fiddling for me.
 These boxes are just compute servers for me, I would have been ok returning
 to Linux, but not if it means giving up a core. People worry about
 compatibility, I sensed a softness in the surround sound in game X..., but
 for me the above numbers put all this in perspective.

 Another way to put this, especially for those who don't have a strong
 preference for building their own machines, and can't wait for Linux to get
 its act together:

 If you're serious about parallel Haskell, buy a Mac Pro.
 ___
 Glasgow-haskell-users mailing list
 Glasgow-haskell-users@haskell.org
 http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-18 Thread Dave Bayer
Yikes! You're right. I never noticed, but I never had an 8 GB Mac  
before.


I looked at ./configure for the GHC 6.10.2 source, and realized there  
was already something there. I tried


./configure --build=x86_64-apple-darwin

and it didn't work. However, it did give me something to Google,  
leading me to


Ticket #2965 (new feature request)
GHC on OS X does not compile 64-bit
1/19/09
http://hackage.haskell.org/trac/ghc/ticket/2965

Apparently this isn't a one-liner. Once my semester ends, I'll see if  
I can help.


I've got to wonder if it would be less work to sneak a gift Mac Pro  
past Microsoft security, and just wait? ;-)


The GHC team has been busting their humps on parallel code lately, and  
OS X does so much better... They should stop having to apologize in  
papers for the poor parallel performance of Linux itself. A decked-out  
Mac Pro should be the flagship platform for 64 bit, parallel GHC.


On Apr 18, 2009, at 1:46 PM, Daniel Peebles wrote:

That looks great! I wonder what about Mac OS leads to such good  
performance...


Now if only we could get a nice x86_64-producing GHC for Mac OS too, I
could use all my RAM and the extra registers my Mac Pro gives me :)


___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: No last core parallel slowdown on OS X

2009-04-18 Thread Austin Seipp
Excerpts from Dave Bayer's message of Sat Apr 18 19:05:34 -0500 2009:
 Yikes! You're right. I never noticed, but I never had an 8 GB Mac  
 before.
 
 I looked at ./configure for the GHC 6.10.2 source, and realized there  
 was already something there. I tried
 
 ./configure --build=x86_64-apple-darwin
 
 and it didn't work. However, it did give me something to Google,  
 leading me to
 
 Ticket #2965 (new feature request)
 GHC on OS X does not compile 64-bit
 1/19/09
 http://hackage.haskell.org/trac/ghc/ticket/2965
 
 Apparently this isn't a one-liner. Once my semester ends, I'll see if  
 I can help.
 
 I've got to wonder if it would be less work to sneak a gift Mac Pro  
 past Microsoft security, and just wait? ;-)
 
 The GHC team has been busting their humps on parallel code lately, and  
 OS X does so much better... They should stop having to apologize in  
 papers for the poor parallel performance of Linux itself. A decked-out  
 Mac Pro should be the flagship platform for 64 bit, parallel GHC.
 
 On Apr 18, 2009, at 1:46 PM, Daniel Peebles wrote:
 
  That looks great! I wonder what about Mac OS leads to such good  
  performance...
 
  Now if only we could get a nice x86_64-producing GHC for Mac OS too, I
  could use all my RAM and the extra registers my Mac Pro gives me :)
 

Please add yourself to the CC list of the bug - more people need to
show they care! I'm currently the owner of the bug, so if you bother
me enough I'll get to working on it quicker (once I have more time...)
At the very least, once the build system fixes are in place to allow
hc-bootstrapping, it should happen fairly quickly. Right now I'm just
not quite sure what the full path necessary is for getting a copy of
GHC head to be 64-bit on OS X.

Daniel, have you gotten anywhere with your version of GHC 6.6 on OS X?

Austin
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users