Also, cache effects. 20M DoFs on one core/thread is huge. 37x on assembly is probably cache effects.
On Mon, Jul 11, 2022 at 1:09 PM Matthew Knepley <[email protected]> wrote: > On Mon, Jul 11, 2022 at 10:34 AM Ce Qin <[email protected]> wrote: > >> Dear all, >> >> I want to analyze the strong scaling of our in-house FEM code. >> The test problem has about 20M DoFs. I ran the problem using >> various settings. The speedups for the assembly and solving >> procedures are as follows: >> Assembly Solving >> NProcessors NNodes CoresPerNode >> 1 1 1 1.0 1.0 >> 2 1 2 1.995246 1.898756 >> 2 1 2.121401 2.436149 >> 4 1 4 4.658187 6.004539 >> 2 2 4.666667 5.942085 >> 4 1 4.65272 6.101214 >> 8 2 4 9.380985 16.581135 >> 4 2 9.308575 17.258891 >> 8 1 9.314449 17.380612 >> 16 2 8 18.575953 34.483058 >> 4 4 18.745129 34.854409 >> 8 2 18.828393 36.45509 >> 32 4 8 37.140626 70.175879 >> 8 4 37.166421 71.533865 >> >> I don't quite understand this result. Why we can achieve a speedup of >> about 70+ using 32 processors? Could you please help me explain this? >> > > We need more data. I would start with the number of iterates that the > solver > executes. I suspect this is changing. However, it can be more complicated. > For example, a Block-Jacobi preconditioner gets cheaper as the number of > subdomains increases. Thus we need to know exactly what the solver is > doing. > > Thanks, > > Matt > > >> Thank you in advance. >> >> Best, >> Ce >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >
