Re: [petsc-users] speedup for TS solver using DMDA

Barry Smith Tue, 16 Sep 2014 12:04:40 -0700

  The tool hwloc can be useful in understanding the organization of cores and 
memories on a machine.   For example I run


lstopo --no-icaches --no-io --ignore PU

(along with make streams in the root PETSc directory) on my laptop and it shows

np  speedup
1 1.0
2 1.43
3 1.47
4 1.45
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (16GB) + NUMANode L#0 (P#0 16GB) + L3 L#0 (6144KB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
  L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
  L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
  L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3

This system has one “memory bank”, 1 CPU and 4 cores. Note that when two cores 
are running the streams benchmark they are essentially utilizing all of the 
memory bandwidth hence you get no further speed up after two cores.

Next I run on a “server” class workstation with two “memory banks”, each 
associated with a CPU with 8 cores 

np  speedup
1 1.0
2 1.8
3 2.21
4 2.35
5 2.4
6 2.41
7 3.3
8 2.4
9 2.66
10 2.22
11 2.28
12 4.04
13 2.46
14 2.61
15 4.11
16 3.01
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (128GB)
  NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (20MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
    L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
    L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
    L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3
    L2 L#4 (256KB) + L1d L#4 (32KB) + Core L#4
    L2 L#5 (256KB) + L1d L#5 (32KB) + Core L#5
    L2 L#6 (256KB) + L1d L#6 (32KB) + Core L#6
    L2 L#7 (256KB) + L1d L#7 (32KB) + Core L#7
  NUMANode L#1 (P#1 64GB) + Socket L#1 + L3 L#1 (20MB)
    L2 L#8 (256KB) + L1d L#8 (32KB) + Core L#8
    L2 L#9 (256KB) + L1d L#9 (32KB) + Core L#9
    L2 L#10 (256KB) + L1d L#10 (32KB) + Core L#10
    L2 L#11 (256KB) + L1d L#11 (32KB) + Core L#11
    L2 L#12 (256KB) + L1d L#12 (32KB) + Core L#12
    L2 L#13 (256KB) + L1d L#13 (32KB) + Core L#13
    L2 L#14 (256KB) + L1d L#14 (32KB) + Core L#14
    L2 L#15 (256KB) + L1d L#15 (32KB) + Core L#15


Note the speedup gets to be as high as 4 meaning that the memory is fast enough 
to fully server at least four cores.  But the speed up jumps all over the place 
when using from 1 to 16 cores. I am guessing that is because the MPI processes 
are not being well mapped to cores.  So I run with the additional MPICH mpiexec 
options -bind-to socket -map-by hwthread  and get 

np  speedup
1 1.0
2 2.26
3 2.79
4 2.93
5 2.99
6 3.0
7 3.01
8 2.99
9 2.81
10 2.81
11 2.9
12 2.94
13 2.94
14 2.94
15 2.93
16 2.93
Estimation of possible speedup of MPI programs based on Streams benchmark.

The I run with just the -bind-to socket and get much better numbers

np  speedup
1 1.0
2 2.41
3 3.36
4 4.45
5 4.51
6 5.45
7 5.07
8 5.81
9 5.27
10 5.93
11 5.42
12 5.95
13 5.49
14 5.94
15 5.56
16 5.88

Using this option I get roughly a speedup of 6. 

See 
http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Process-core_Binding
 for more information on these options

 Barry




On Sep 16, 2014, at 10:08 AM, Katy Ghantous <[email protected]> wrote:

> thank you! this has been extremely useful in figuring out a plan of action. 
> 
> 
> On Mon, Sep 15, 2014 at 9:08 PM, Barry Smith <[email protected]> wrote:
> 
>   Based on the streams speedups below it looks like a single core can utilize 
> roughly 1/2 of the memory bandwidth, leaving all the other cores only 1/2 of 
> the bandwidth to utilize, so you can only expect at best a speedup of roughly 
> 2 on this machine with traditional PETSc sparse solvers.
> 
>   To add insult to injury it appears that the threads are not being assigned 
> to physical cores very well either.  Under the best circumstance on this 
> system one would like to see a speedup of about 2 when running with two 
> processes but it actually delivers only 1.23 and the speedup of 2 only occurs 
> with 5 processes. I attribute this to the MPI or OS not assigning the second 
> MPI process to the “best” core for memory bandwidth. Likely it should assign 
> the second MPI process to the 2nd CPU but instead it is assigning it also to 
> the first CPU and only when it gets to the 5th MPI process does the second 
> CPU get utilized.
> 
>    You can look at the documentation for your MPI’s process affinity to see 
> if you can force the 2nd MPI process onto the second CPU.
> 
>    Barry
> 
> 
> np  speedup
> 1 1.0
> 2 1.23
> 3 1.3
> 4 1.75
> 5 2.18
> 
> 
> 6 1.22
> 7 2.3
> 8 1.22
> 9 2.01
> 10 1.19
> 11 1.93
> 12 1.93
> 13 1.73
> 14 2.17
> 15 1.99
> 16 2.08
> 17 2.16
> 18 1.47
> 19 1.95
> 20 2.09
> 21 1.9
> 22 1.96
> 23 1.92
> 24 2.02
> 25 1.96
> 26 1.89
> 27 1.93
> 28 1.97
> 29 1.96
> 30 1.93
> 31 2.16
> 32 2.12
> Estimation of possible
> 
> On Sep 15, 2014, at 1:42 PM, Katy Ghantous <[email protected]> wrote:
> 
> > Matt, thanks! i will look into that and find other ways to make the 
> > computation faster.
> >
> > Barry, the benchmark reports up to 2 speedup, but says 1 node in the end. 
> > but either way i was expecting a higher speedup.. 2 is the limit for two 
> > cpus despite the multiple cores?
> >
> > please let me know if the file attached is what you are asking for.
> > Thank you!
> >
> >
> > On Mon, Sep 15, 2014 at 8:23 PM, Barry Smith <[email protected]> wrote:
> >
> >    Please send the output from running
> >
> >     make steams NPMAX=32
> >
> >     in the PETSc root directory.
> >
> >
> >    Barry
> >
> >   My guess is that it reports “one node” is just because it uses the 
> > “hostname” to distinguish nodes and though your machine has two CPUs, from 
> > the point of view of the OS it has only a single hostname and hence reports 
> > just one “node”.
> >
> >
> > On Sep 15, 2014, at 12:45 PM, Katy Ghantous <[email protected]> wrote:
> >
> > > Hi,
> > > I am using DMDA to run in parallel TS to solves a set of N equations. I 
> > > am using DMDAGetCorners in the RHSfunction with setting the stencil size 
> > > at 2 to solve a set of coupled ODEs on 30 cores.
> > > The machine has 32 cores (2 physical CPUs with 2x8 core each with speed 
> > > of 3.4Ghz per core).
> > > However, mpiexec with more than one core is showing no speedup.
> > > Also at the configuring/testing stage for petsc on that machine, there 
> > > was no speedup and it only reported one node.
> > > Is there somehting wrong with how i configured petsc or is the approach 
> > > inappropriate for the machine?
> > > I am not sure what files (or sections of the code) you would need to be 
> > > able to answer my question.
> > >
> > > Thank you!
> >
> >
> > <scaling.log>
> 
>

Re: [petsc-users] speedup for TS solver using DMDA

Reply via email to