[Haskell-cafe] Re: Parallel Pi

2010-03-21 Thread Daniel Fischer
-Ursprüngliche Nachricht-
Von: Simon Marlow marlo...@gmail.com
Gesendet: 19.03.2010 09:24:12
An: Daniel Fischer daniel.is.fisc...@web.de
Betreff: Re: Parallel Pi

On 18/03/10 22:52, Daniel Fischer wrote:
 Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:
 On 17/03/10 21:30, Daniel Fischer wrote:
 It works for me (GHC 6.12.1):

 SPARKS: 1 (1 converted, 0 pruned)

 INIT  time0.00s  (  0.00s elapsed)
 MUT   time9.05s  (  4.54s elapsed)
 GCtime0.12s  (  0.09s elapsed)
 EXIT  time0.00s  (  0.01s elapsed)
 Total time9.12s  (  4.63s elapsed)

 wall-clock speedup of 1.93 on 2 cores.

 Is that Artyom's original code or with the pseq'ed length?

Your fixed version.

Good. So I can at least continue to believe I have a rough idea of how GHC 
behaves.


 And, with -N2, I also have a productivity of 193.5%, but the elapsed time
 is larger than the elapsed time for -N1. How long does it take with -N1 for
 you?

The 1.93 speedup was compared to the time for -N1 (8.98s in my case).

 What hardware are you using there?

 3.06GHz Pentium 4, 2 cores.
 I have mixed results with parallelism, some programmes get a speed-up of
 nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take
 about the same wall-clock time as the single threaded programme, some -
 like this - take longer despite using both cores intensively.

I suspect it's something specific to that processor, probably 
cache-related.  Perhaps we've managed to put some data frequently 
accessed by both CPUs on the same cache line.  I'd have to do some 
detailed profiling on that processor to find out though.  If you're have 
the time and inclination, install oprofile and look for things like 
memory ordering stalls.


It seems that I've just been fooled by /proc/cpuinfo listing it as two and 
having something like 190% cpu usage in top/time.
Being oblivious of almost everything hardware-related, I naively took it at 
face value.
In fact it's probably just one hyperthreaded CPU, so since the two threads here 
do exactly the same type of work, it's natural then that it doesn't give a 
speed-up.

 Have you tried changing any GC settings?

 I've played around a little with -qg and -qb and -C, but that showed little
 influence. Any tips what else might be worth a try?

-A would be the other thing to try.

Cheers,
   Simon


 Cheers,
 Simon


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-19 Thread Brandon S. Allbery KF8NH

On Mar 18, 2010, at 21:25 , Xiao-Yong Jin wrote:

On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:

core id : 0
cpu cores   : 1


It is one of those pathetic single core pentium4 with so
called hyper-threading enabled.  You should have checked the
intel product spreadsheet before investing such an old cpu.


I'm a little surprised it's using both; I thought Linux (and other  
OSes) had disabled HTT by default because of the cache sniffing attacks.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-19 Thread Brandon S. Allbery KF8NH

On Mar 18, 2010, at 21:58 , Daniel Fischer wrote:

Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:

On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:

core id : 0
cpu cores   : 1


It is one of those pathetic single core pentium4 with so
called hyper-threading enabled.


'kay, but why does it say

processor   : 0
...
processor   : 1
?


Because that's how Linux presents what amounts to CPU resources,  
whether real (multiple cores) or virtual (HTT).  You need to scan down  
to the core information to see if they're real or not.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Parallel Pi

2010-03-19 Thread Simon Marlow

On 18/03/10 22:52, Daniel Fischer wrote:

Am Donnerstag 18 März 2010 22:44:55 schrieb Simon Marlow:

On 17/03/10 21:30, Daniel Fischer wrote:

Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:

Hello!
I tried to implement the parallel Monte-Carlo method of computing Pi
number, using two cores:


move


But it uses only on core:


snip


We see that our one spark is pruned. Why?


Well, the problem is that your tasks don't do any real work - yet.
piMonte returns a thunk pretty immediately, that thunk is then
evaluated by show, long after your chance for parallelism is gone. You
must force the work to be done _in_ r1 and r2, then you get
parallelism:

Generation 0:  2627 collections,  2626 parallel,  0.14s,  0.12s
elapsed Generation 1: 1 collections, 1 parallel,  0.00s,
0.00s elapsed

Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)

  MUT time (elapsed)   GC time  (elapsed)
Task  0 (worker) :0.00s(  8.22s)   0.00s(  0.00s)
Task  1 (worker) :8.16s(  8.22s)   0.01s(  0.01s)
Task  2 (worker) :8.00s(  8.22s)   0.13s(  0.11s)
Task  3 (worker) :0.00s(  8.22s)   0.00s(  0.00s)

SPARKS: 1 (1 converted, 0 pruned)

INIT  time0.00s  (  0.00s elapsed)
MUT   time   16.14s  (  8.22s elapsed)
GCtime0.14s  (  0.12s elapsed)
EXIT  time0.00s  (  0.00s elapsed)
Total time   16.29s  (  8.34s elapsed)

%GC time   0.9%  (1.4% elapsed)

Alloc rate163,684,377 bytes per MUT second

Productivity  99.1% of total user, 193.5% of total elapsed

But alas, it is slower than the single-threaded calculation :(

INIT  time0.00s  (  0.00s elapsed)
MUT   time7.08s  (  7.10s elapsed)
GCtime0.08s  (  0.08s elapsed)
EXIT  time0.00s  (  0.00s elapsed)
Total time7.15s  (  7.18s elapsed)


It works for me (GHC 6.12.1):

SPARKS: 1 (1 converted, 0 pruned)

INIT  time0.00s  (  0.00s elapsed)
MUT   time9.05s  (  4.54s elapsed)
GCtime0.12s  (  0.09s elapsed)
EXIT  time0.00s  (  0.01s elapsed)
Total time9.12s  (  4.63s elapsed)

wall-clock speedup of 1.93 on 2 cores.


Is that Artyom's original code or with the pseq'ed length?


Your fixed version.


And, with -N2, I also have a productivity of 193.5%, but the elapsed time
is larger than the elapsed time for -N1. How long does it take with -N1 for
you?


The 1.93 speedup was compared to the time for -N1 (8.98s in my case).


What hardware are you using there?


3.06GHz Pentium 4, 2 cores.
I have mixed results with parallelism, some programmes get a speed-up of
nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take
about the same wall-clock time as the single threaded programme, some -
like this - take longer despite using both cores intensively.


I suspect it's something specific to that processor, probably 
cache-related.  Perhaps we've managed to put some data frequently 
accessed by both CPUs on the same cache line.  I'd have to do some 
detailed profiling on that processor to find out though.  If you're have 
the time and inclination, install oprofile and look for things like 
memory ordering stalls.



Have you tried changing any GC settings?


I've played around a little with -qg and -qb and -C, but that showed little
influence. Any tips what else might be worth a try?


-A would be the other thing to try.

Cheers,
Simon



Cheers,
Simon




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-19 Thread Ketil Malde
Daniel Fischer daniel.is.fisc...@web.de writes:

 3.06GHz Pentium 4, 2 cores.

[I.e. a single-core hyperthreaded CPU]

 I have mixed results with parallelism, some programmes get a speed-up of 
 nearly a factor 2 (wall-clock time), others 1.4, 1.5 or so, yet others take 
 about the same wall-clock time as the single threaded programme, some - 
 like this - take longer despite using both cores intensively.

Given the negative press around HT, I'm surprised you see this good
results on many programs.  I thought the main benefit from Intel's HT was
to reduce the impact of memory latency, that is, when one thread was
blocking on memory, it could switch immediately to anther, ready-to-run,
thread. (I may be misunderstanding this, though).

I think the general consensus was a 10-15% speedup from HT.

Anyway, the thing to get these days is of course Nehalem, A.K.A. Core
i{3,5,7}, which seems to give a nice speedup over Core 2.  Among other
things, it dynamically overclocks the busy cores (using the
more market-friendly term turbo mode), making it even harder to
compare performance reliably.  Interesting times.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Parallel Pi

2010-03-19 Thread Simon Marlow

On 19/03/10 09:00, Ketil Malde wrote:

Daniel Fischerdaniel.is.fisc...@web.de  writes:


3.06GHz Pentium 4, 2 cores.


[I.e. a single-core hyperthreaded CPU]


Ah, that would definitely explain a lack of parallelism.  I'm just 
grateful we don't have another one of those multicore cache-line 
performance bugs, becuase they're a nightmare to track down.


Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Simon Marlow

On 17/03/10 21:30, Daniel Fischer wrote:

Am Mittwoch 17 März 2010 19:49:57 schrieb Artyom Kazak:

Hello!
I tried to implement the parallel Monte-Carlo method of computing Pi
number, using two cores:

move


But it uses only on core:


snip


We see that our one spark is pruned. Why?



Well, the problem is that your tasks don't do any real work - yet.
piMonte returns a thunk pretty immediately, that thunk is then evaluated by
show, long after your chance for parallelism is gone. You must force the
work to be done _in_ r1 and r2, then you get parallelism:

   Generation 0:  2627 collections,  2626 parallel,  0.14s,  0.12s elapsed
   Generation 1: 1 collections, 1 parallel,  0.00s,  0.00s elapsed

   Parallel GC work balance: 1.79 (429262 / 240225, ideal 2)

 MUT time (elapsed)   GC time  (elapsed)
   Task  0 (worker) :0.00s(  8.22s)   0.00s(  0.00s)
   Task  1 (worker) :8.16s(  8.22s)   0.01s(  0.01s)
   Task  2 (worker) :8.00s(  8.22s)   0.13s(  0.11s)
   Task  3 (worker) :0.00s(  8.22s)   0.00s(  0.00s)

   SPARKS: 1 (1 converted, 0 pruned)

   INIT  time0.00s  (  0.00s elapsed)
   MUT   time   16.14s  (  8.22s elapsed)
   GCtime0.14s  (  0.12s elapsed)
   EXIT  time0.00s  (  0.00s elapsed)
   Total time   16.29s  (  8.34s elapsed)

   %GC time   0.9%  (1.4% elapsed)

   Alloc rate163,684,377 bytes per MUT second

   Productivity  99.1% of total user, 193.5% of total elapsed

But alas, it is slower than the single-threaded calculation :(

   INIT  time0.00s  (  0.00s elapsed)
   MUT   time7.08s  (  7.10s elapsed)
   GCtime0.08s  (  0.08s elapsed)
   EXIT  time0.00s  (  0.00s elapsed)
   Total time7.15s  (  7.18s elapsed)


It works for me (GHC 6.12.1):

  SPARKS: 1 (1 converted, 0 pruned)

  INIT  time0.00s  (  0.00s elapsed)
  MUT   time9.05s  (  4.54s elapsed)
  GCtime0.12s  (  0.09s elapsed)
  EXIT  time0.00s  (  0.01s elapsed)
  Total time9.12s  (  4.63s elapsed)

wall-clock speedup of 1.93 on 2 cores.

What hardware are you using there? Have you tried changing any GC settings?

Cheers,
Simon
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Erik de Castro Lopo
Daniel Fischer wrote:

 3.06GHz Pentium 4, 2 cores.

Do you have more info on that? Try:

 grep 'model name' /proc/cpuinfo

The original Pentium 4 (eg Intel(R) Pentium(R) 4 CPU 3.00GHz) had
hyperthreading which was actually pretty pathetic for parallelism.

The Core 2 Duos (eg Intel(R) Core(TM)2 Duo CPU T9600  @ 2.80GHz)
are far superior.

Erik
-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Daniel Fischer
Am Freitag 19 März 2010 00:56:15 schrieb Erik de Castro Lopo:
 Daniel Fischer wrote:
  3.06GHz Pentium 4, 2 cores.

 Do you have more info on that? Try:

  grep 'model name' /proc/cpuinfo

Well,

$ cat /proc/cpuinfo  
processor   : 0 
vendor_id   : GenuineIntel  
cpu family  : 15
model   : 4 
model name  : Intel(R) Pentium(R) 4 CPU 3.06GHz 
stepping: 9 
cpu MHz : 3058.795  
cache size  : 1024 KB   
physical id : 0 
siblings: 2 
core id : 0 
cpu cores   : 1 
apicid  : 0 
initial apicid  : 0 
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes   
fpu_exception   : yes   
   
cpuid level : 5 
   
wp  : yes   
   
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm 
constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm
bogomips: 6117.59
clflush size: 64
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  : Intel(R) Pentium(R) 4 CPU 3.06GHz
stepping: 9
cpu MHz : 3058.795
cache size  : 1024 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
apicid  : 1
initial apicid  : 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm 
constant_tsc pebs bts pni monitor ds_cpl tm2 cid cx16 xtpr lahf_lm
bogomips: 6118.20
clflush size: 64
power management:

Does that mean two CPUs, each with two siblings, or what is the correct 
interpretation?


 The original Pentium 4 (eg Intel(R) Pentium(R) 4 CPU 3.00GHz) had
 hyperthreading which was actually pretty pathetic for parallelism.

 The Core 2 Duos (eg Intel(R) Core(TM)2 Duo CPU T9600  @ 2.80GHz)
 are far superior.

But probably also far more expensive :)
I bought something cheap and was actually surprised when I discovered that 
it seemed to have two Cores/CPUs.


 Erik

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Xiao-Yong Jin
On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:

 core id : 0
 cpu cores   : 1

It is one of those pathetic single core pentium4 with so
called hyper-threading enabled.  You should have checked the
intel product spreadsheet before investing such an old cpu.
-- 
Jc/*__o/*
X\ * (__
Y*/\  
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Daniel Fischer
Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
 On Fri, 19 Mar 2010 01:22:58 +0100, Daniel Fischer wrote:
  core id : 0
  cpu cores   : 1

 It is one of those pathetic single core pentium4 with so
 called hyper-threading enabled.

'kay, but why does it say

processor   : 0 
...
processor   : 1
?

 You should have checked the
 intel product spreadsheet before investing such an old cpu.

It was the cheapest box in town :)
And it was less old when I bought it.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Erik de Castro Lopo
Daniel Fischer wrote:

 Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
 
  It is one of those pathetic single core pentium4 with so
  called hyper-threading enabled.
 
 'kay, but why does it say
 
 processor   : 0 
 ...
 processor   : 1

Hyperthreading is explained here:

http://www.pcstats.com/articleview.cfm?articleID=1302

As explained, two hyperthreads is not euqivalent to to CPU cores
because the two hyperthreads share resources while 2 discrete cores
do not.

As I remember it, the performance of the Pentium 4s with HT never met
up to the promise and that line was swiftly replaced by the Core 2 Duo
range of CPUs which we actually quite good.

As a rough and ready test, I compiled Ben Lippmeier's DDC compiler
on the following CPUS:

   a) Intel(R) Pentium(R) 4 CPU 3.00GHz (2Meg cache)
   b) Intel(R) Core(TM)2 Duo CPU T9600  @ 2.80GHz (6Meg cache)

Using the ghc-6.12.1 on both (32bit Ubuntu 10.04 chroot for the P4
and a 32bit Debian unstable chroot for the Core2Duo), compiling DDC
took (using 'make clean ; time make'):

   a) 2m54.301s on the P4 HT
   b) 0m59.277s on the Core2Duo

If nothing else, it shows that two CPUs with similar clock speeds
and the same number of processors listed in /proc/cpuinfo can have
vastly different performance characteristics.

Erik
-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Parallel Pi

2010-03-18 Thread Daniel Fischer
Am Freitag 19 März 2010 04:24:21 schrieb Erik de Castro Lopo:
 Daniel Fischer wrote:
  Am Freitag 19 März 2010 02:25:47 schrieb Xiao-Yong Jin:
   It is one of those pathetic single core pentium4 with so
   called hyper-threading enabled.
 
  'kay, but why does it say
 
  processor   : 0
  ...
  processor   : 1

 Hyperthreading is explained here:

 http://www.pcstats.com/articleview.cfm?articleID=1302

Thanks. That clears things up a little.


 Erik

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe