[new data: I had an error in the old ex54 timings] OK, well the blocked version of this optimization was a fun ride. I still get about 2% variance on timings but here are the best out of a few runs of each solver with the new optimized code. I do watch the activity monitor and these are one core runs on a 4 core machine.
Scalar (ex54): KSPSolve 1 1.0 1.3231e+00 1.0 1.40e+09 1.0 0.0e+00 0.0e+00 5.0e+00 29 72 0 0 3 33 72 0 0 3 1057 2-vector (ex55) KSPSolve 1 1.0 2.1459e+00 1.0 3.15e+09 1.0 0.0e+00 0.0e+00 7.0e+00 50 88 0 0 4 100100 0 0100 1468 3-vector (ex56) KSPSolve 1 1.0 9.3956e-01 1.0 1.52e+09 1.0 0.0e+00 0.0e+00 7.0e+00 26 65 0 0 5 100100 0 0100 1619 and the old version: Scalar (ex54): KSPSolve 1 1.0 1.3830e+00 1.0 1.53e+09 1.0 0.0e+00 0.0e+00 5.0e+00 29 74 0 0 3 33 74 0 0 3 1103 2-vector (ex55) KSPSolve 1 1.0 2.1489e+00 1.0 3.45e+09 1.0 0.0e+00 0.0e+00 7.0e+00 50 89 0 0 4 100100 0 0100 1606 3-vector (ex56) KSPSolve 1 1.0 9.8354e-01 1.0 1.67e+09 1.0 0.0e+00 0.0e+00 7.0e+00 27 67 0 0 5 100100 0 0100 1700 Mark On Aug 23, 2013, at 3:21 PM, Barry Smith <[email protected]> wrote: > > Run a bit bigger problem. The results shouldn't be that noisy; unless you are > playing some game in another window :-). > > Barry > > On Aug 23, 2013, at 11:52 AM, Mark F. Adams <[email protected]> wrote: > >> This code looks funny in MatSOR_SeqAIJ: >> >> x[i] = (1-omega)*x[i] + sum*idiag[i]; >> >> shouldn't this be: >> >> x[i] = (1-omega)*x[i] + omega*sum*idiag[i]; >> >> and I've done the first optimization (for the non-blocked version) and get >> this (running each many time because my Mac is a little noisy). Flop rates >> go down a bit and total flops got down a lot (perhaps I have a flop counting >> bug?). 6 old runs then7 new ones: >> >> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1105 >> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1097 >> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1109 >> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 >> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 15 34 0 0 0 1095 >> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 >> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 11 29 0 0 0 14 29 0 0 0 910 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 908 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 911 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 886 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 909 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 901 >> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f >> make-local runex54 | grep MatSOR >> MatSOR 60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 12 29 0 0 0 14 29 0 0 0 902 >> >> >> On Aug 20, 2013, at 6:40 PM, Barry Smith <[email protected]> wrote: >> >>> >>> On Aug 20, 2013, at 4:39 PM, Jed Brown <[email protected]> wrote: >>> >>>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, >>>> for which triangular storage would be worse >>> >>> Is it clear that triangular storage would be particularly worse? I think >>> only a small percentage worse for a multiply. The problem with CSR for >>> triangular solves is for each new row one needs to jump some amount in the >>> index and double array wasting cache lines and limiting streaming. With >>> multiply with triangular storage there is no jumping, just streaming two >>> different index and double arrays and current CPUs have no problem managing >>> 4 streams. >>> >>> In other words CSR good for multiply, bad for triangular solve; triangular >>> storage good for triangular solve and pretty good for multiply. >>> >>> Barry >>> >>> >>> Barry >>> >>>> (unless you add an Eisenstat optimization). >>>> >>>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <[email protected]> wrote: >>>> Barry, >>>> >>>> These are tests using the well load balanced problems in KSP on my Mac: >>>> >>>> 3D Vector (ex56) >>>> >>>> 8 proc: >>>> >>>> rich/ssor(1) >>>>> KSPSolve 1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 >>>>> 8.5e+01 19 26 26 16 8 100100100100100 2047 >>>> >>>> cheb/jac(2) >>>>> KSPSolve 1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 >>>>> 8.8e+01 17 25 27 19 8 100100100100100 2053 >>>> >>>> 1 proc: >>>> >>>> rich/ssor(1) >>>>> KSPSolve 1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 10 22 0 0 0 100100 0 0 0 1727 >>>> >>>> cheb/jac(2) >>>>> KSPSolve 1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 12 24 0 0 0 100100 0 0 0 1870 >>>> >>>> 2D Scalar (ex54) 4 procs >>>> >>>> cheb/jac(2) >>>>> KSPSolve 1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 >>>>> 7.8e+02 91100 98 93 95 100100100100100 592 >>>> >>>> rich/ssor(1) >>>>> KSPSolve 1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 >>>>> 7.1e+02 89100 98 90 95 100100100100100 453 >>>> >>>> I'm not sure if this matters but I wanted to get off of the funny test >>>> problem that I was using. >>>> >>>> Mark >>>> >>>> On Aug 17, 2013, at 5:30 PM, Barry Smith <[email protected]> wrote: >>>> >>>>> >>>>> Mark, >>>>> >>>>> Why rich/eisenstat(2) ? With Cheb/eisenstat you get the acceleration of >>>>> Cheby on top of the SOR but rich/eisenstat(2) just seems like some >>>>> strange implementation of sor? >>>>> >>>>> I am guessing you are getting your 15% potential improvement from >>>>> >>>>>>>> 7.63/11.3 >>>>> 0.6752212389380531 >>>>>>>> .6752212389380531*5.9800e+00 >>>>> 4.037823008849558 >>>>>>>> .63/4.63 >>>>> 0.13606911447084233 >>>>> >>>>> Anyways I would focus on rich/ssor() smoothing. >>>>> >>>>> Before thinking about implementing a new data structure I think it is >>>>> relatively easy to reduce more the total number of flops done with >>>>> ssor(1) smoother [and despite the current conventional wisdom that doing >>>>> extra flops is in a good thing in HPC :-)). >>>>> >>>>> Presmoothing: Note that since you are running ssor(1) and zero initial >>>>> guess the current implementation is already good in terms of reducing >>>>> total flops; it does one down solve (saving the intermediate values) and >>>>> then one up solve (using those saved values). >>>>> >>>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the >>>>> current implementation in MatSOR_SeqAIJ() is bad because it applies the >>>>> entire matrix in both the down solve and the up solve. >>>>> >>>>> while (its--) { >>>>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) { >>>>> for (i=0; i<m; i++) { >>>>> n = a->i[i+1] - a->i[i]; >>>>> idx = a->j + a->i[i]; >>>>> v = a->a + a->i[i]; >>>>> sum = b[i]; >>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>>>> } >>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>>>> } >>>>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) { >>>>> for (i=m-1; i>=0; i--) { >>>>> n = a->i[i+1] - a->i[i]; >>>>> idx = a->j + a->i[i]; >>>>> v = a->a + a->i[i]; >>>>> sum = b[i]; >>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>>>> } >>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>>>> } >>>>> >>>>> (Note also that this partially explains why the flop rate in the table >>>>> for MatSOR() s so good, two of the three ssor triangular solves are >>>>> actually full matrix vector products). What it should do is apply the >>>>> entire matrix in the down solve (but save the partial sums of the lower >>>>> triangular part) and then it can do the upper solve applying only 1/2 the >>>>> matrix and using the accumulated values. So currently the combined pre >>>>> and post smooth has 3 complete applications of the matrix (.5 and .5 from >>>>> the pre smooth and 1 and 1 from the up smooth. Adding the optimization >>>>> will reduce it to 2.5 applications of the matrix giving a reduction of >>>>> 1/6 = 17 percent of the smooth flops or >>>>> >>>>>>>> 17.*14/33 >>>>> 7.212121212121212 >>>>> >>>>> 7 percent decrease in the total KSPSolve flops >>>>> >>>>> Next we can eliminate 1/2 of the application of the matrix in the >>>>> residual computation after the smoothing if we have saved the partial >>>>> sums of the upper triangular solve in the smoother. We can estimate the >>>>> flops of the matrix-multiple in the residual computation as one >>>>> application of the matrix and hence equal to 1/3 of the flops in the ssor >>>>> computations. Eliminating 1/2 of that means eliminating the equivalent of >>>>> eliminating 1/6 of the flops in the MatSOR computation which is >>>>> >>>>>>>> 14./(33*6) >>>>> 0.0707070707070707 (actually the same computation as above :-( >>>>> >>>>> so another 7 percent of the total flops in the KSPSolve. So in total we >>>>> could save 14 percent of the flops (and lots of memory access) in the >>>>> KSPSolve at the work of editing a couple of functions. >>>>> >>>>> At that point we could compute the flop rate of all the SSOR triangular >>>>> solves and the percent of the time of the KSPSolves() in the them to >>>>> determine if a new data structure is warranted. Note that in reducing the >>>>> total number of flops we are replacing two full matrix-vector products >>>>> with 2 triangular solves hence the flop rate will be lower indicating >>>>> that using a new data structure will actually help more. >>>>> >>>>> Notes on the implementation: >>>>> >>>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial >>>>> guess is straightforward; edit one functions >>>>> >>>>> * modifying the residual computation in the multigrid to reuse the upper >>>>> triangular sums requires a tiny bit more thought. Basically >>>>> MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves >>>>> for the upper triangular part and we would need to either (1) have a >>>>> MatMult_SeqAIJ() implementation that would retrieve those values and use >>>>> them appropriately (how would it know when the values are valid?) or (2) >>>>> use the PCMGSetResidual() to provide a custom residual routine that again >>>>> pulled out the accumulated sums and used them. Of course we will >>>>> eventually figure out how to organize it cleanly for users. >>>>> >>>>> >>>>> Note also the even if a new custom data structure is introduced the work >>>>> outlined above is not lost since the new data structure still needs those >>>>> flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for >>>>> nonzero initial guess, get a roughly 6 percent improvement, step 2, >>>>> optimize the residual computation, get another 6 percent improvement, >>>>> step 3, introduce a new data structure and get another roughly 5 percent >>>>> improvement in KSPSolve() (the 6, 6 and 5 are my guesstimates). >>>>> >>>>> I would be very interested in seeing new numbers. >>>>> >>>>> Final note: the very unload balancing of this problem will limit how much >>>>> these optimizations will help. For a perfectly load balanced problem I >>>>> think these optimizations would a lot higher in percentage (maybe twice >>>>> as much? more?). >>>>> >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <[email protected]> wrote: >>>>> >>>>>> I would like to get an idea of how much benefit there would be with >>>>>> using a special matrix type for SOR. Here is an experiment on 128 cores >>>>>> of Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look >>>>>> like higher order stencils at BCs. 32^3 subdomian on each core: >>>>>> >>>>>> cheb/jacobi(2) >>>>>> KSPSolve 15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 >>>>>> 2.8e+02 7 29 67 46 7 26100100100 76 18765 >>>>>> >>>>>> >>>>>> rich/eisenstat(2) >>>>>> KSPSolve 15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 >>>>>> 2.8e+02 12 32 66 44 7 38100100100 76 11659 >>>>>> >>>>>> >>>>>> rich/sor >>>>>> KSPSolve 15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 >>>>>> 3.1e+02 10 21 57 31 8 33100100100 77 15708 >>>>>> >>>>>> >>>>>> Complete log files attached. The "projection" solve is the solver of >>>>>> interest. >>>>>> >>>>>> I have 2 Jacobi so that it has about the same amount of work a one >>>>>> (s)sor. There are voids in the domain which I believe accounts for the >>>>>> large differences in the number of flops per core. These were run with >>>>>> the same processor group (i.e., all runs done in the same qsub script) >>>>>> >>>>>> This shows only about 15% potential gain. Should we conclude that there >>>>>> is not much to gain from an optimized data structure? >>>>>> >>>>>> Mark >>>>>> <log_eis><log_jac><log_sor> >>>>>> >>>>>> >>>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <[email protected]> wrote: >>>>>> >>>>>>> "Mark F. Adams" <[email protected]> writes: >>>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some >>>>>>>> problems but for me robustness trumps this for default solver >>>>>>>> parameters in PETSc. >>>>>>> >>>>>>> Richardson/SOR is about the best thing out there in terms of reasonable >>>>>>> local work, low/no setup cost, and reliable convergence. Cheby/Jacobi >>>>>>> just has trivial fine-grained parallelism, but it's not clear that buys >>>>>>> anything on a CPU. >>>>>>> >>>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work >>>>>>>> but needs a specialized matrix to get good flop rates. Something to >>>>>>>> think about doing … >>>>>>> >>>>>>> Mine was too sloppy, Barry got it right. Eisenstat is for Cheby/SOR, >>>>>>> however, and doesn't do anything for Richardson. >>>>>>> >>>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to >>>>>>> cache the action of the lower triangular part in the forward sweep so >>>>>>> that the back-sweep can use it, and vice-versa. With full caching and >>>>>>> triangular residual optimization, I think this brings 2 SSOR iterations >>>>>>> of the down-smoother plus a residual to 2.5 work units in the >>>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth >>>>>>> (nonzero initial guess). (This is a strong smoother and frequently, one >>>>>>> SSOR would be enough.) >>>>>> >>>>> >>>> >>> >> >
