Re: [deal.II] General questions in distributed parallelization

Wolfgang Bangerth Wed, 06 Dec 2017 12:14:55 -0800


Jie,

First I want to say, it is so much easier and more pleasant to work withPETScWrappers in dealii than writing PETSc code in C. Developing finiteelement code based on dealii is 1000 times more efficient than writingfrom scratch!


Thank you for the kind words!

Recently I parallelized a time-dependent linear elasticity code withPETScWrappers, based on step-40. I tested its performance on my desktop.But I am not sure if the result is "normal".
The test case is the bending of a 3D cantilever beam. There are 262144active cells and 839619 dofs. I ran 2 time steps, which involves twoassembling and solving calls, on 1, 2, 4, 8 processors, respectively.The wall time looks like this:
           |  assemble (s)     |    solve (s)    |    total (s)
-------------------------------------------------------------------
n = 1  |     274.0             |     44.44        |    362.0
n = 2  |    140.50            |     27.54        |    192.0
n = 4  |     75.72             |     17.12        |    106.3
n = 8  |     64.74             |     16.62        |     92.82
The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n= 8, it is insignificant. Actually sometimes running with 8 ranks iseven slower thanrunning with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4cores but 8 threads, and 16GB memory. I do not quite understand thedifference between "thread" and "rank". Should I expect the performanceto scale up to 4 or 8 mpi ranks?

You can't expect to gain a factor of 2 when going from 4 to 8 MPI rankson this processor. That's because the i7-6700K has only 4 real cores, see

  https://en.wikipedia.org/wiki/Intel_Core#Core_i7

which means that there are four processing units on this chip. But, eachof them presents itself as two "virtual cores", i.e., it can execute twothreads at the same time, but it really only has the resources for oneinstruction at a time (sort of, not speaking precisely here). This helpsbecause in reality instructions often sit idle waiting for data toarrive from memory, and in this case the physical infrastructure canwork on an instruction from the other thread. In your case, thisimproves performance by 10-15%, but ultimately, you are still limited bythe fact that your processor only has four units to do floating pointaddition, four units to do floating point multiplication, etc -- becauseit really only has 4 cores.

Another general question is that: in distributed parallel applications,I use temporary objects to copy a non-ghosted vector to a ghostedvector, or vice versa all the time. For example, I use non-ghostedvector to store my solution, but have to copy it to a ghosted vectorwhen I output results or refine mesh. On the other hand, if I useghosted vector to store my solution, I have to copy it to a non-ghostedvector when I manipulate it with PETScWrappers::VectorBase::add (forexample subtracting time discretization terms from it).
I wanna ask, is this copy operation expensive? Is there a way to avoid that?

It's almost certainly not expensive enough for you to worry about. It'ssignificantly cheaper to copy a vector this way than to do onematrix-vector multiplication, for example.


Best
 W.

--
------------------------------------------------------------------------
Wolfgang Bangerth          email:                 [email protected]
                           www: http://www.math.colostate.edu/~bangerth/

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en

---You received this message because you are subscribed to the Google Groups "deal.II User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [deal.II] General questions in distributed parallelization

Reply via email to