Jie,
First I want to say, it is so much easier and more pleasant to work with
PETScWrappers in dealii than writing PETSc code in C. Developing finite
element code based on dealii is 1000 times more efficient than writing
from scratch!
Thank you for the kind words!
Recently I parallelized a time-dependent linear elasticity code with
PETScWrappers, based on step-40. I tested its performance on my desktop.
But I am not sure if the result is "normal".
The test case is the bending of a 3D cantilever beam. There are 262144
active cells and 839619 dofs. I ran 2 time steps, which involves two
assembling and solving calls, on 1, 2, 4, 8 processors, respectively.
The wall time looks like this:
| assemble (s) | solve (s) | total (s)
-------------------------------------------------------------------
n = 1 | 274.0 | 44.44 | 362.0
n = 2 | 140.50 | 27.54 | 192.0
n = 4 | 75.72 | 17.12 | 106.3
n = 8 | 64.74 | 16.62 | 92.82
The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n
= 8, it is insignificant. Actually sometimes running with 8 ranks is
even slower than
running with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4
cores but 8 threads, and 16GB memory. I do not quite understand the
difference between "thread" and "rank". Should I expect the performance
to scale up to 4 or 8 mpi ranks?
You can't expect to gain a factor of 2 when going from 4 to 8 MPI ranks
on this processor. That's because the i7-6700K has only 4 real cores, see
https://en.wikipedia.org/wiki/Intel_Core#Core_i7
which means that there are four processing units on this chip. But, each
of them presents itself as two "virtual cores", i.e., it can execute two
threads at the same time, but it really only has the resources for one
instruction at a time (sort of, not speaking precisely here). This helps
because in reality instructions often sit idle waiting for data to
arrive from memory, and in this case the physical infrastructure can
work on an instruction from the other thread. In your case, this
improves performance by 10-15%, but ultimately, you are still limited by
the fact that your processor only has four units to do floating point
addition, four units to do floating point multiplication, etc -- because
it really only has 4 cores.
Another general question is that: in distributed parallel applications,
I use temporary objects to copy a non-ghosted vector to a ghosted
vector, or vice versa all the time. For example, I use non-ghosted
vector to store my solution, but have to copy it to a ghosted vector
when I output results or refine mesh. On the other hand, if I use
ghosted vector to store my solution, I have to copy it to a non-ghosted
vector when I manipulate it with PETScWrappers::VectorBase::add (for
example subtracting time discretization terms from it).
I wanna ask, is this copy operation expensive? Is there a way to avoid that?
It's almost certainly not expensive enough for you to worry about. It's
significantly cheaper to copy a vector this way than to do one
matrix-vector multiplication, for example.
Best
W.
--
------------------------------------------------------------------------
Wolfgang Bangerth email: [email protected]
www: http://www.math.colostate.edu/~bangerth/
--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see
https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.