The tool hwloc can be useful in understanding the organization of cores and
memories on a machine. For example I run
lstopo --no-icaches --no-io --ignore PU
(along with make streams in the root PETSc directory) on my laptop and it shows
np speedup
1 1.0
2 1.43
3 1.47
4 1.45
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (16GB) + NUMANode L#0 (P#0 16GB) + L3 L#0 (6144KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3
This system has one “memory bank”, 1 CPU and 4 cores. Note that when two cores
are running the streams benchmark they are essentially utilizing all of the
memory bandwidth hence you get no further speed up after two cores.
Next I run on a “server” class workstation with two “memory banks”, each
associated with a CPU with 8 cores
np speedup
1 1.0
2 1.8
3 2.21
4 2.35
5 2.4
6 2.41
7 3.3
8 2.4
9 2.66
10 2.22
11 2.28
12 4.04
13 2.46
14 2.61
15 4.11
16 3.01
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 1 node(s)
See graph in the file src/benchmarks/streams/scaling.png
Machine (128GB)
NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + Core L#0
L2 L#1 (256KB) + L1d L#1 (32KB) + Core L#1
L2 L#2 (256KB) + L1d L#2 (32KB) + Core L#2
L2 L#3 (256KB) + L1d L#3 (32KB) + Core L#3
L2 L#4 (256KB) + L1d L#4 (32KB) + Core L#4
L2 L#5 (256KB) + L1d L#5 (32KB) + Core L#5
L2 L#6 (256KB) + L1d L#6 (32KB) + Core L#6
L2 L#7 (256KB) + L1d L#7 (32KB) + Core L#7
NUMANode L#1 (P#1 64GB) + Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + Core L#8
L2 L#9 (256KB) + L1d L#9 (32KB) + Core L#9
L2 L#10 (256KB) + L1d L#10 (32KB) + Core L#10
L2 L#11 (256KB) + L1d L#11 (32KB) + Core L#11
L2 L#12 (256KB) + L1d L#12 (32KB) + Core L#12
L2 L#13 (256KB) + L1d L#13 (32KB) + Core L#13
L2 L#14 (256KB) + L1d L#14 (32KB) + Core L#14
L2 L#15 (256KB) + L1d L#15 (32KB) + Core L#15
Note the speedup gets to be as high as 4 meaning that the memory is fast enough
to fully server at least four cores. But the speed up jumps all over the place
when using from 1 to 16 cores. I am guessing that is because the MPI processes
are not being well mapped to cores. So I run with the additional MPICH mpiexec
options -bind-to socket -map-by hwthread and get
np speedup
1 1.0
2 2.26
3 2.79
4 2.93
5 2.99
6 3.0
7 3.01
8 2.99
9 2.81
10 2.81
11 2.9
12 2.94
13 2.94
14 2.94
15 2.93
16 2.93
Estimation of possible speedup of MPI programs based on Streams benchmark.
The I run with just the -bind-to socket and get much better numbers
np speedup
1 1.0
2 2.41
3 3.36
4 4.45
5 4.51
6 5.45
7 5.07
8 5.81
9 5.27
10 5.93
11 5.42
12 5.95
13 5.49
14 5.94
15 5.56
16 5.88
Using this option I get roughly a speedup of 6.
See
http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Process-core_Binding
for more information on these options
Barry
On Sep 16, 2014, at 10:08 AM, Katy Ghantous <[email protected]> wrote:
> thank you! this has been extremely useful in figuring out a plan of action.
>
>
> On Mon, Sep 15, 2014 at 9:08 PM, Barry Smith <[email protected]> wrote:
>
> Based on the streams speedups below it looks like a single core can utilize
> roughly 1/2 of the memory bandwidth, leaving all the other cores only 1/2 of
> the bandwidth to utilize, so you can only expect at best a speedup of roughly
> 2 on this machine with traditional PETSc sparse solvers.
>
> To add insult to injury it appears that the threads are not being assigned
> to physical cores very well either. Under the best circumstance on this
> system one would like to see a speedup of about 2 when running with two
> processes but it actually delivers only 1.23 and the speedup of 2 only occurs
> with 5 processes. I attribute this to the MPI or OS not assigning the second
> MPI process to the “best” core for memory bandwidth. Likely it should assign
> the second MPI process to the 2nd CPU but instead it is assigning it also to
> the first CPU and only when it gets to the 5th MPI process does the second
> CPU get utilized.
>
> You can look at the documentation for your MPI’s process affinity to see
> if you can force the 2nd MPI process onto the second CPU.
>
> Barry
>
>
> np speedup
> 1 1.0
> 2 1.23
> 3 1.3
> 4 1.75
> 5 2.18
>
>
> 6 1.22
> 7 2.3
> 8 1.22
> 9 2.01
> 10 1.19
> 11 1.93
> 12 1.93
> 13 1.73
> 14 2.17
> 15 1.99
> 16 2.08
> 17 2.16
> 18 1.47
> 19 1.95
> 20 2.09
> 21 1.9
> 22 1.96
> 23 1.92
> 24 2.02
> 25 1.96
> 26 1.89
> 27 1.93
> 28 1.97
> 29 1.96
> 30 1.93
> 31 2.16
> 32 2.12
> Estimation of possible
>
> On Sep 15, 2014, at 1:42 PM, Katy Ghantous <[email protected]> wrote:
>
> > Matt, thanks! i will look into that and find other ways to make the
> > computation faster.
> >
> > Barry, the benchmark reports up to 2 speedup, but says 1 node in the end.
> > but either way i was expecting a higher speedup.. 2 is the limit for two
> > cpus despite the multiple cores?
> >
> > please let me know if the file attached is what you are asking for.
> > Thank you!
> >
> >
> > On Mon, Sep 15, 2014 at 8:23 PM, Barry Smith <[email protected]> wrote:
> >
> > Please send the output from running
> >
> > make steams NPMAX=32
> >
> > in the PETSc root directory.
> >
> >
> > Barry
> >
> > My guess is that it reports “one node” is just because it uses the
> > “hostname” to distinguish nodes and though your machine has two CPUs, from
> > the point of view of the OS it has only a single hostname and hence reports
> > just one “node”.
> >
> >
> > On Sep 15, 2014, at 12:45 PM, Katy Ghantous <[email protected]> wrote:
> >
> > > Hi,
> > > I am using DMDA to run in parallel TS to solves a set of N equations. I
> > > am using DMDAGetCorners in the RHSfunction with setting the stencil size
> > > at 2 to solve a set of coupled ODEs on 30 cores.
> > > The machine has 32 cores (2 physical CPUs with 2x8 core each with speed
> > > of 3.4Ghz per core).
> > > However, mpiexec with more than one core is showing no speedup.
> > > Also at the configuring/testing stage for petsc on that machine, there
> > > was no speedup and it only reported one node.
> > > Is there somehting wrong with how i configured petsc or is the approach
> > > inappropriate for the machine?
> > > I am not sure what files (or sections of the code) you would need to be
> > > able to answer my question.
> > >
> > > Thank you!
> >
> >
> > <scaling.log>
>
>