Run a bit bigger problem. The results shouldn't be that noisy; unless you are playing some game in another window :-).
Barry On Aug 23, 2013, at 11:52 AM, Mark F. Adams <[email protected]> wrote: > This code looks funny in MatSOR_SeqAIJ: > > x[i] = (1-omega)*x[i] + sum*idiag[i]; > > shouldn't this be: > > x[i] = (1-omega)*x[i] + omega*sum*idiag[i]; > > and I've done the first optimization (for the non-blocked version) and get > this (running each many time because my Mac is a little noisy). Flop rates > go down a bit and total flops got down a lot (perhaps I have a flop counting > bug?). 6 old runs then7 new ones: > > 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 14 34 0 0 0 1105 > 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 14 34 0 0 0 1097 > 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 14 34 0 0 0 1109 > 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 > 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 15 34 0 0 0 1095 > 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107 > 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local > runex54 | grep MatSOR > MatSOR 60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 11 29 0 0 0 14 29 0 0 0 910 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 908 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 911 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 886 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 909 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 901 > 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f > make-local runex54 | grep MatSOR > MatSOR 60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 29 0 0 0 14 29 0 0 0 902 > > > On Aug 20, 2013, at 6:40 PM, Barry Smith <[email protected]> wrote: > >> >> On Aug 20, 2013, at 4:39 PM, Jed Brown <[email protected]> wrote: >> >>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for >>> which triangular storage would be worse >> >> Is it clear that triangular storage would be particularly worse? I think >> only a small percentage worse for a multiply. The problem with CSR for >> triangular solves is for each new row one needs to jump some amount in the >> index and double array wasting cache lines and limiting streaming. With >> multiply with triangular storage there is no jumping, just streaming two >> different index and double arrays and current CPUs have no problem managing >> 4 streams. >> >> In other words CSR good for multiply, bad for triangular solve; triangular >> storage good for triangular solve and pretty good for multiply. >> >> Barry >> >> >> Barry >> >>> (unless you add an Eisenstat optimization). >>> >>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <[email protected]> wrote: >>> Barry, >>> >>> These are tests using the well load balanced problems in KSP on my Mac: >>> >>> 3D Vector (ex56) >>> >>> 8 proc: >>> >>> rich/ssor(1) >>>> KSPSolve 1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02 >>>> 8.5e+01 19 26 26 16 8 100100100100100 2047 >>> >>> cheb/jac(2) >>>> KSPSolve 1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02 >>>> 8.8e+01 17 25 27 19 8 100100100100100 2053 >>> >>> 1 proc: >>> >>> rich/ssor(1) >>>> KSPSolve 1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00 >>>> 0.0e+00 10 22 0 0 0 100100 0 0 0 1727 >>> >>> cheb/jac(2) >>>> KSPSolve 1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00 >>>> 0.0e+00 12 24 0 0 0 100100 0 0 0 1870 >>> >>> 2D Scalar (ex54) 4 procs >>> >>> cheb/jac(2) >>>> KSPSolve 1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02 >>>> 7.8e+02 91100 98 93 95 100100100100100 592 >>> >>> rich/ssor(1) >>>> KSPSolve 1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02 >>>> 7.1e+02 89100 98 90 95 100100100100100 453 >>> >>> I'm not sure if this matters but I wanted to get off of the funny test >>> problem that I was using. >>> >>> Mark >>> >>> On Aug 17, 2013, at 5:30 PM, Barry Smith <[email protected]> wrote: >>> >>>> >>>> Mark, >>>> >>>> Why rich/eisenstat(2) ? With Cheb/eisenstat you get the acceleration of >>>> Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange >>>> implementation of sor? >>>> >>>> I am guessing you are getting your 15% potential improvement from >>>> >>>>>>> 7.63/11.3 >>>> 0.6752212389380531 >>>>>>> .6752212389380531*5.9800e+00 >>>> 4.037823008849558 >>>>>>> .63/4.63 >>>> 0.13606911447084233 >>>> >>>> Anyways I would focus on rich/ssor() smoothing. >>>> >>>> Before thinking about implementing a new data structure I think it is >>>> relatively easy to reduce more the total number of flops done with ssor(1) >>>> smoother [and despite the current conventional wisdom that doing extra >>>> flops is in a good thing in HPC :-)). >>>> >>>> Presmoothing: Note that since you are running ssor(1) and zero initial >>>> guess the current implementation is already good in terms of reducing >>>> total flops; it does one down solve (saving the intermediate values) and >>>> then one up solve (using those saved values). >>>> >>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the >>>> current implementation in MatSOR_SeqAIJ() is bad because it applies the >>>> entire matrix in both the down solve and the up solve. >>>> >>>> while (its--) { >>>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) { >>>> for (i=0; i<m; i++) { >>>> n = a->i[i+1] - a->i[i]; >>>> idx = a->j + a->i[i]; >>>> v = a->a + a->i[i]; >>>> sum = b[i]; >>>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>>> } >>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>>> } >>>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) { >>>> for (i=m-1; i>=0; i--) { >>>> n = a->i[i+1] - a->i[i]; >>>> idx = a->j + a->i[i]; >>>> v = a->a + a->i[i]; >>>> sum = b[i]; >>>> PetscSparseDenseMinusDot(sum,x,v,idx,n); >>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i]; >>>> } >>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr); >>>> } >>>> >>>> (Note also that this partially explains why the flop rate in the table for >>>> MatSOR() s so good, two of the three ssor triangular solves are actually >>>> full matrix vector products). What it should do is apply the entire matrix >>>> in the down solve (but save the partial sums of the lower triangular part) >>>> and then it can do the upper solve applying only 1/2 the matrix and using >>>> the accumulated values. So currently the combined pre and post smooth has >>>> 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1 >>>> and 1 from the up smooth. Adding the optimization will reduce it to 2.5 >>>> applications of the matrix giving a reduction of 1/6 = 17 percent of the >>>> smooth flops or >>>> >>>>>>> 17.*14/33 >>>> 7.212121212121212 >>>> >>>> 7 percent decrease in the total KSPSolve flops >>>> >>>> Next we can eliminate 1/2 of the application of the matrix in the residual >>>> computation after the smoothing if we have saved the partial sums of the >>>> upper triangular solve in the smoother. We can estimate the flops of the >>>> matrix-multiple in the residual computation as one application of the >>>> matrix and hence equal to 1/3 of the flops in the ssor computations. >>>> Eliminating 1/2 of that means eliminating the equivalent of eliminating >>>> 1/6 of the flops in the MatSOR computation which is >>>> >>>>>>> 14./(33*6) >>>> 0.0707070707070707 (actually the same computation as above :-( >>>> >>>> so another 7 percent of the total flops in the KSPSolve. So in total we >>>> could save 14 percent of the flops (and lots of memory access) in the >>>> KSPSolve at the work of editing a couple of functions. >>>> >>>> At that point we could compute the flop rate of all the SSOR triangular >>>> solves and the percent of the time of the KSPSolves() in the them to >>>> determine if a new data structure is warranted. Note that in reducing the >>>> total number of flops we are replacing two full matrix-vector products >>>> with 2 triangular solves hence the flop rate will be lower indicating that >>>> using a new data structure will actually help more. >>>> >>>> Notes on the implementation: >>>> >>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial >>>> guess is straightforward; edit one functions >>>> >>>> * modifying the residual computation in the multigrid to reuse the upper >>>> triangular sums requires a tiny bit more thought. Basically >>>> MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves >>>> for the upper triangular part and we would need to either (1) have a >>>> MatMult_SeqAIJ() implementation that would retrieve those values and use >>>> them appropriately (how would it know when the values are valid?) or (2) >>>> use the PCMGSetResidual() to provide a custom residual routine that again >>>> pulled out the accumulated sums and used them. Of course we will >>>> eventually figure out how to organize it cleanly for users. >>>> >>>> >>>> Note also the even if a new custom data structure is introduced the work >>>> outlined above is not lost since the new data structure still needs those >>>> flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for >>>> nonzero initial guess, get a roughly 6 percent improvement, step 2, >>>> optimize the residual computation, get another 6 percent improvement, step >>>> 3, introduce a new data structure and get another roughly 5 percent >>>> improvement in KSPSolve() (the 6, 6 and 5 are my guesstimates). >>>> >>>> I would be very interested in seeing new numbers. >>>> >>>> Final note: the very unload balancing of this problem will limit how much >>>> these optimizations will help. For a perfectly load balanced problem I >>>> think these optimizations would a lot higher in percentage (maybe twice as >>>> much? more?). >>>> >>>> >>>> Barry >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <[email protected]> wrote: >>>> >>>>> I would like to get an idea of how much benefit there would be with using >>>>> a special matrix type for SOR. Here is an experiment on 128 cores of >>>>> Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like >>>>> higher order stencils at BCs. 32^3 subdomian on each core: >>>>> >>>>> cheb/jacobi(2) >>>>> KSPSolve 15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03 >>>>> 2.8e+02 7 29 67 46 7 26100100100 76 18765 >>>>> >>>>> >>>>> rich/eisenstat(2) >>>>> KSPSolve 15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03 >>>>> 2.8e+02 12 32 66 44 7 38100100100 76 11659 >>>>> >>>>> >>>>> rich/sor >>>>> KSPSolve 15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03 >>>>> 3.1e+02 10 21 57 31 8 33100100100 77 15708 >>>>> >>>>> >>>>> Complete log files attached. The "projection" solve is the solver of >>>>> interest. >>>>> >>>>> I have 2 Jacobi so that it has about the same amount of work a one >>>>> (s)sor. There are voids in the domain which I believe accounts for the >>>>> large differences in the number of flops per core. These were run with >>>>> the same processor group (i.e., all runs done in the same qsub script) >>>>> >>>>> This shows only about 15% potential gain. Should we conclude that there >>>>> is not much to gain from an optimized data structure? >>>>> >>>>> Mark >>>>> <log_eis><log_jac><log_sor> >>>>> >>>>> >>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <[email protected]> wrote: >>>>> >>>>>> "Mark F. Adams" <[email protected]> writes: >>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some >>>>>>> problems but for me robustness trumps this for default solver >>>>>>> parameters in PETSc. >>>>>> >>>>>> Richardson/SOR is about the best thing out there in terms of reasonable >>>>>> local work, low/no setup cost, and reliable convergence. Cheby/Jacobi >>>>>> just has trivial fine-grained parallelism, but it's not clear that buys >>>>>> anything on a CPU. >>>>>> >>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work >>>>>>> but needs a specialized matrix to get good flop rates. Something to >>>>>>> think about doing … >>>>>> >>>>>> Mine was too sloppy, Barry got it right. Eisenstat is for Cheby/SOR, >>>>>> however, and doesn't do anything for Richardson. >>>>>> >>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to >>>>>> cache the action of the lower triangular part in the forward sweep so >>>>>> that the back-sweep can use it, and vice-versa. With full caching and >>>>>> triangular residual optimization, I think this brings 2 SSOR iterations >>>>>> of the down-smoother plus a residual to 2.5 work units in the >>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth >>>>>> (nonzero initial guess). (This is a strong smoother and frequently, one >>>>>> SSOR would be enough.) >>>>> >>>> >>> >> >
