Hi Wolfgang Thank you so much for the clear answer!
Jie On Wednesday, December 6, 2017 at 3:14:46 PM UTC-5, Wolfgang Bangerth wrote: > > > Jie, > > > First I want to say, it is so much easier and more pleasant to work with > > PETScWrappers in dealii than writing PETSc code in C. Developing finite > > element code based on dealii is 1000 times more efficient than writing > > from scratch! > > Thank you for the kind words! > > > > Recently I parallelized a time-dependent linear elasticity code with > > PETScWrappers, based on step-40. I tested its performance on my desktop. > > But I am not sure if the result is "normal". > > > > The test case is the bending of a 3D cantilever beam. There are 262144 > > active cells and 839619 dofs. I ran 2 time steps, which involves two > > assembling and solving calls, on 1, 2, 4, 8 processors, respectively. > > The wall time looks like this: > > > > | assemble (s) | solve (s) | total (s) > > ------------------------------------------------------------------- > > n = 1 | 274.0 | 44.44 | 362.0 > > n = 2 | 140.50 | 27.54 | 192.0 > > n = 4 | 75.72 | 17.12 | 106.3 > > n = 8 | 64.74 | 16.62 | 92.82 > > > > The speedup from n = 1 up to n = 4 is quite obvious, but from n = 4 to n > > = 8, it is insignificant. Actually sometimes running with 8 ranks is > > even slower than > > running with 4 ranks. My desktop has one intel i7-6700k cpu, which has 4 > > cores but 8 threads, and 16GB memory. I do not quite understand the > > difference between "thread" and "rank". Should I expect the performance > > to scale up to 4 or 8 mpi ranks? > > You can't expect to gain a factor of 2 when going from 4 to 8 MPI ranks > on this processor. That's because the i7-6700K has only 4 real cores, see > https://en.wikipedia.org/wiki/Intel_Core#Core_i7 > which means that there are four processing units on this chip. But, each > of them presents itself as two "virtual cores", i.e., it can execute two > threads at the same time, but it really only has the resources for one > instruction at a time (sort of, not speaking precisely here). This helps > because in reality instructions often sit idle waiting for data to > arrive from memory, and in this case the physical infrastructure can > work on an instruction from the other thread. In your case, this > improves performance by 10-15%, but ultimately, you are still limited by > the fact that your processor only has four units to do floating point > addition, four units to do floating point multiplication, etc -- because > it really only has 4 cores. > > > > Another general question is that: in distributed parallel applications, > > I use temporary objects to copy a non-ghosted vector to a ghosted > > vector, or vice versa all the time. For example, I use non-ghosted > > vector to store my solution, but have to copy it to a ghosted vector > > when I output results or refine mesh. On the other hand, if I use > > ghosted vector to store my solution, I have to copy it to a non-ghosted > > vector when I manipulate it with PETScWrappers::VectorBase::add (for > > example subtracting time discretization terms from it). > > I wanna ask, is this copy operation expensive? Is there a way to avoid > that? > > It's almost certainly not expensive enough for you to worry about. It's > significantly cheaper to copy a vector this way than to do one > matrix-vector multiplication, for example. > > Best > W. > > -- > ------------------------------------------------------------------------ > Wolfgang Bangerth email: [email protected] > <javascript:> > www: http://www.math.colostate.edu/~bangerth/ > -- The deal.II project is located at http://www.dealii.org/ For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en --- You received this message because you are subscribed to the Google Groups "deal.II User Group" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
