Jie,

First I want to say, it is so much easier and more pleasant to work with PETScWrappers in dealii than writing PETSc code in C. Developing finite element code based on dealii is 1000 times more efficient than writing from scratch!

Thank you for the kind words!


Recently I parallelized a time-dependent linear elasticity code with PETScWrappers, based on step-40. I tested its performance on my desktop. But I am not sure if the result is "normal".

The test case is the bending of a 3D cantilever beam. There are 262144 active cells and 839619 dofs. I ran 2 time steps, which involves two assembling and solving calls, on 1, 2, 4, 8 processors, respectively. The wall time looks like this:

           |  assemble (s)     |    solve (s)    |    total (s)
-------------------------------------------------------------------
n = 1  |     274.0             |     44.44        |    362.0
n = 2  |    140.50            |     27.54        |    192.0
n = 4  |     75.72             |     17.12        |    106.3
n = 8  |     64.74             |     16.62        |     92.82

The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n = 8, it is insignificant. Actually sometimes running with 8 ranks is even slower than running with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4 cores but 8 threads, and 16GB memory. I do not quite understand the difference between "thread" and "rank". Should I expect the performance to scale up to 4 or 8 mpi ranks?

You can't expect to gain a factor of 2 when going from 4 to 8 MPI ranks on this processor. That's because the i7-6700K has only 4 real cores, see
  https://en.wikipedia.org/wiki/Intel_Core#Core_i7
which means that there are four processing units on this chip. But, each of them presents itself as two "virtual cores", i.e., it can execute two threads at the same time, but it really only has the resources for one instruction at a time (sort of, not speaking precisely here). This helps because in reality instructions often sit idle waiting for data to arrive from memory, and in this case the physical infrastructure can work on an instruction from the other thread. In your case, this improves performance by 10-15%, but ultimately, you are still limited by the fact that your processor only has four units to do floating point addition, four units to do floating point multiplication, etc -- because it really only has 4 cores.


Another general question is that: in distributed parallel applications, I use temporary objects to copy a non-ghosted vector to a ghosted vector, or vice versa all the time. For example, I use non-ghosted vector to store my solution, but have to copy it to a ghosted vector when I output results or refine mesh. On the other hand, if I use ghosted vector to store my solution, I have to copy it to a non-ghosted vector when I manipulate it with PETScWrappers::VectorBase::add (for example subtracting time discretization terms from it).
I wanna ask, is this copy operation expensive? Is there a way to avoid that?

It's almost certainly not expensive enough for you to worry about. It's significantly cheaper to copy a vector this way than to do one matrix-vector multiplication, for example.

Best
 W.

--
------------------------------------------------------------------------
Wolfgang Bangerth          email:                 [email protected]
                           www: http://www.math.colostate.edu/~bangerth/

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to