> > > I guess you are interested in the performance of the new algorithms on > small problems. I will try to test a petsc example such as > mat/examples/tests/ex96.c. >
It's not a big deal. And the fact that they are similar on one node tells us the kernels are similar. > > >> >> And are you sure the numerics are the same with and without hypre? Hypre >> is 15x slower. Any ideas what is going on? >> > > Hypre performs pretty good when the number of processor core is small ( a > couple of hundreds). I guess the issue is related to how they handle the > communications. > > >> >> It might be interesting to scale this test down to a node to see if this >> is from communication. >> > I wonder if the their symbolic setup is getting called every time. You do 50 solves it looks like and that should be enough to amortize a one time setup cost. Does PETSc do any clever scalability tricks? You just pack and send point to point messages I would think, but maybe Hypre is doing something bad. I have seen Hypre scale out to large machine but on synthetic problems. So this is a realistic problem. Can you run with -info and grep on GAMG and send me the (~20 lines) of output. You will be able to see info about each level, like number of equations and average nnz/row. > > Hypre preforms similarly as petsc on a single compute node. > > > Fande, > > >> >> Again, nice work, >> Mark >> >> >> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong...@gmail.com> wrote: >> >>> Hi Developers, >>> >>> I just want to share a good news. It is known PETSc-ptap-scalable is >>> taking too much memory for some applications because it needs to build >>> intermediate data structures. According to Mark's suggestions, I >>> implemented the all-at-once algorithm that does not cache any intermediate >>> data. >>> >>> I did some comparison, the new implementation is actually scalable in >>> terms of the memory usage and the compute time even though it is still >>> slower than "ptap-scalable". There are some memory profiling results (see >>> the attachments). The new all-at-once implementation use the similar amount >>> of memory as hypre, but it way faster than hypre. >>> >>> For example, for a problem with 14,893,346,880 unknowns using 10,000 >>> processor cores, There are timing results: >>> >>> Hypre algorithm: >>> >>> MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 >>> 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 >>> MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 >>> 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 >>> >>> PETSc scalable PtAP: >>> >>> MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 >>> 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418 >>> MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 >>> 3.5e+02 1 0 3 3 9 1 0 3 3 9 0 >>> MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 >>> 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011 >>> >>> New implementation of the all-at-once algorithm: >>> >>> MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 >>> 6.0e+02 4 0 7 7 17 4 0 7 7 17 0 >>> MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 >>> 2.0e+02 2 0 5 4 6 2 0 5 4 6 0 >>> MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 >>> 4.0e+02 2 0 2 3 11 2 0 2 3 11 0 >>> >>> >>> You can see here the all-at-once is a bit slower than ptap-scalable, but >>> it uses only much less memory. >>> >>> >>> Fande >>> >>> >>