Dear David,
Without knowing the exact components of deal.II you are using, the first
places where I would start looking into is whether you use
multi-threaded blas or multithreading within deal.II. So you could try to do
export DEAL_II_NUM_THREADS=1
export OMP_NUM_THREADS=1
or disable multithreading from the compilation of deal.II (and use
serial BLAS/LAPACK libraries) and check again. The behavior you're
describing looks to be a combination of something that sees a good
speedup in some parts of the solver, but very little to none in other parts.
The second suspicion would be memory bandwidth limitations within the
node, but even if you are fully memory bound you should see a factor of
~10-12 of speedup when going from 1 to 48 cores on a node (or a bit less
if the processor has full turbo frequency turned on and thus clocks
higher with 1 core loaded than with all 24 cores loaded per socket),
while you observe much less than that.
Best,
Martin
On 22.01.21 05:52, David Montiel Taboada wrote:
Hello,
I am using the PRISMS-PF framework (which is based on deal.II) on the
Skylake (skx) nodes (with 48 processors each) of the Stampede2 cluster.
I recently ran a series of strong scaling tests and noticed that the
intra-node performance (i.e. 1 node, 1-48 processors) scales poorly,
specifically the solver part. However, once I get past one node, the
scaling is closer to ideal (taking 1 node as a reference).
Here is the behavior I got (solver part only; in every case I used as
many MPI threads as processors):
Processors, Nodes, Solver time (s)
1, 1, 821
2 , 1, 608
4 , 1, 525
8, 1, 482
24, 1, 435
48, 1, 427
96, 2, 211
192, 4, 109
Does anyone know what may be the problem?
The code uses the matrix-free method and requires only the p4est and
mpi libraries, which I included as dependencies when I did cmake to
install deal.II.
Here is the line I used
cmake -DDEAL_II_WITH_MPI=ON -DDEAL_II_WITH_P4EST=ON
-DCMAKE_INSTALL_PREFIX=$WORK/dealii_install $WORK/dealii-9.2.0
Am I perhaps missing a flag?
By the way, the home nodes (which I used to install deal.II and
compile my code) are also Skylake, so I would expect my code to have a
good performance.
I do not observe the same issue elsewhere (e. g., on my local machine
or on the KNL nodes on Cori).
Any help that might help me figure out this issue is appreciated.
Best,
David
--
The deal.II project is located at http://www.dealii.org/
<http://www.dealii.org/>
For mailing list/forum options, see
https://groups.google.com/d/forum/dealii?hl=en
<https://groups.google.com/d/forum/dealii?hl=en>
---
You received this message because you are subscribed to the Google
Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/dealii/8bd2837d-c284-4f1e-a194-ad4a56835cb6n%40googlegroups.com
<https://groups.google.com/d/msgid/dealii/8bd2837d-c284-4f1e-a194-ad4a56835cb6n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see
https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/dealii/b22ad7fa-ccb8-17da-4c5c-daebfe30dd1d%40gmail.com.