On Aug 23, 2013, at 11:52 AM, Mark F. Adams <[email protected]> wrote:
> This code looks funny in MatSOR_SeqAIJ:
>
> x[i] = (1-omega)*x[i] + sum*idiag[i];
>
> shouldn't this be:
>
> x[i] = (1-omega)*x[i] + omega*sum*idiag[i];
Mark,
The idiag computed in MatInvertDiagonal_SeqAIJ() already has the omega
stuff in it. I believe the code is correct.
Barry
>
> and I've done the first optimization (for the non-blocked version) and get
> this (running each many time because my Mac is a little noisy). Flop rates
> go down a bit and total flops got down a lot (perhaps I have a flop counting
> bug?). 6 old runs then7 new ones:
>
> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1105
> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1097
> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1109
> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107
> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 15 34 0 0 0 1095
> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107
> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f make-local
> runex54 | grep MatSOR
> MatSOR 60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 11 29 0 0 0 14 29 0 0 0 910
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 908
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 911
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 886
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 909
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 901
> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
> make-local runex54 | grep MatSOR
> MatSOR 60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 29 0 0 0 14 29 0 0 0 902
>
>
> On Aug 20, 2013, at 6:40 PM, Barry Smith <[email protected]> wrote:
>
>>
>> On Aug 20, 2013, at 4:39 PM, Jed Brown <[email protected]> wrote:
>>
>>> Hmm, less clear flops benefit. Half of these are nonzero initial guess, for
>>> which triangular storage would be worse
>>
>> Is it clear that triangular storage would be particularly worse? I think
>> only a small percentage worse for a multiply. The problem with CSR for
>> triangular solves is for each new row one needs to jump some amount in the
>> index and double array wasting cache lines and limiting streaming. With
>> multiply with triangular storage there is no jumping, just streaming two
>> different index and double arrays and current CPUs have no problem managing
>> 4 streams.
>>
>> In other words CSR good for multiply, bad for triangular solve; triangular
>> storage good for triangular solve and pretty good for multiply.
>>
>> Barry
>>
>>
>> Barry
>>
>>> (unless you add an Eisenstat optimization).
>>>
>>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <[email protected]> wrote:
>>> Barry,
>>>
>>> These are tests using the well load balanced problems in KSP on my Mac:
>>>
>>> 3D Vector (ex56)
>>>
>>> 8 proc:
>>>
>>> rich/ssor(1)
>>>> KSPSolve 1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04 7.2e+02
>>>> 8.5e+01 19 26 26 16 8 100100100100100 2047
>>>
>>> cheb/jac(2)
>>>> KSPSolve 1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04 7.3e+02
>>>> 8.8e+01 17 25 27 19 8 100100100100100 2053
>>>
>>> 1 proc:
>>>
>>> rich/ssor(1)
>>>> KSPSolve 1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 10 22 0 0 0 100100 0 0 0 1727
>>>
>>> cheb/jac(2)
>>>> KSPSolve 1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 12 24 0 0 0 100100 0 0 0 1870
>>>
>>> 2D Scalar (ex54) 4 procs
>>>
>>> cheb/jac(2)
>>>> KSPSolve 1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03 8.7e+02
>>>> 7.8e+02 91100 98 93 95 100100100100100 592
>>>
>>> rich/ssor(1)
>>>> KSPSolve 1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03 8.8e+02
>>>> 7.1e+02 89100 98 90 95 100100100100100 453
>>>
>>> I'm not sure if this matters but I wanted to get off of the funny test
>>> problem that I was using.
>>>
>>> Mark
>>>
>>> On Aug 17, 2013, at 5:30 PM, Barry Smith <[email protected]> wrote:
>>>
>>>>
>>>> Mark,
>>>>
>>>> Why rich/eisenstat(2) ? With Cheb/eisenstat you get the acceleration of
>>>> Cheby on top of the SOR but rich/eisenstat(2) just seems like some strange
>>>> implementation of sor?
>>>>
>>>> I am guessing you are getting your 15% potential improvement from
>>>>
>>>>>>> 7.63/11.3
>>>> 0.6752212389380531
>>>>>>> .6752212389380531*5.9800e+00
>>>> 4.037823008849558
>>>>>>> .63/4.63
>>>> 0.13606911447084233
>>>>
>>>> Anyways I would focus on rich/ssor() smoothing.
>>>>
>>>> Before thinking about implementing a new data structure I think it is
>>>> relatively easy to reduce more the total number of flops done with ssor(1)
>>>> smoother [and despite the current conventional wisdom that doing extra
>>>> flops is in a good thing in HPC :-)).
>>>>
>>>> Presmoothing: Note that since you are running ssor(1) and zero initial
>>>> guess the current implementation is already good in terms of reducing
>>>> total flops; it does one down solve (saving the intermediate values) and
>>>> then one up solve (using those saved values).
>>>>
>>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess, the
>>>> current implementation in MatSOR_SeqAIJ() is bad because it applies the
>>>> entire matrix in both the down solve and the up solve.
>>>>
>>>> while (its--) {
>>>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
>>>> for (i=0; i<m; i++) {
>>>> n = a->i[i+1] - a->i[i];
>>>> idx = a->j + a->i[i];
>>>> v = a->a + a->i[i];
>>>> sum = b[i];
>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>> }
>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>> }
>>>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
>>>> for (i=m-1; i>=0; i--) {
>>>> n = a->i[i+1] - a->i[i];
>>>> idx = a->j + a->i[i];
>>>> v = a->a + a->i[i];
>>>> sum = b[i];
>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>> }
>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>> }
>>>>
>>>> (Note also that this partially explains why the flop rate in the table for
>>>> MatSOR() s so good, two of the three ssor triangular solves are actually
>>>> full matrix vector products). What it should do is apply the entire matrix
>>>> in the down solve (but save the partial sums of the lower triangular part)
>>>> and then it can do the upper solve applying only 1/2 the matrix and using
>>>> the accumulated values. So currently the combined pre and post smooth has
>>>> 3 complete applications of the matrix (.5 and .5 from the pre smooth and 1
>>>> and 1 from the up smooth. Adding the optimization will reduce it to 2.5
>>>> applications of the matrix giving a reduction of 1/6 = 17 percent of the
>>>> smooth flops or
>>>>
>>>>>>> 17.*14/33
>>>> 7.212121212121212
>>>>
>>>> 7 percent decrease in the total KSPSolve flops
>>>>
>>>> Next we can eliminate 1/2 of the application of the matrix in the residual
>>>> computation after the smoothing if we have saved the partial sums of the
>>>> upper triangular solve in the smoother. We can estimate the flops of the
>>>> matrix-multiple in the residual computation as one application of the
>>>> matrix and hence equal to 1/3 of the flops in the ssor computations.
>>>> Eliminating 1/2 of that means eliminating the equivalent of eliminating
>>>> 1/6 of the flops in the MatSOR computation which is
>>>>
>>>>>>> 14./(33*6)
>>>> 0.0707070707070707 (actually the same computation as above :-(
>>>>
>>>> so another 7 percent of the total flops in the KSPSolve. So in total we
>>>> could save 14 percent of the flops (and lots of memory access) in the
>>>> KSPSolve at the work of editing a couple of functions.
>>>>
>>>> At that point we could compute the flop rate of all the SSOR triangular
>>>> solves and the percent of the time of the KSPSolves() in the them to
>>>> determine if a new data structure is warranted. Note that in reducing the
>>>> total number of flops we are replacing two full matrix-vector products
>>>> with 2 triangular solves hence the flop rate will be lower indicating that
>>>> using a new data structure will actually help more.
>>>>
>>>> Notes on the implementation:
>>>>
>>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial
>>>> guess is straightforward; edit one functions
>>>>
>>>> * modifying the residual computation in the multigrid to reuse the upper
>>>> triangular sums requires a tiny bit more thought. Basically
>>>> MatSOR_SeqAIJ() would have an internal pointer to the accumulated solves
>>>> for the upper triangular part and we would need to either (1) have a
>>>> MatMult_SeqAIJ() implementation that would retrieve those values and use
>>>> them appropriately (how would it know when the values are valid?) or (2)
>>>> use the PCMGSetResidual() to provide a custom residual routine that again
>>>> pulled out the accumulated sums and used them. Of course we will
>>>> eventually figure out how to organize it cleanly for users.
>>>>
>>>>
>>>> Note also the even if a new custom data structure is introduced the work
>>>> outlined above is not lost since the new data structure still needs those
>>>> flop reduction optimizations. So step one, fix the MatSOR_SeqAIJ() for
>>>> nonzero initial guess, get a roughly 6 percent improvement, step 2,
>>>> optimize the residual computation, get another 6 percent improvement, step
>>>> 3, introduce a new data structure and get another roughly 5 percent
>>>> improvement in KSPSolve() (the 6, 6 and 5 are my guesstimates).
>>>>
>>>> I would be very interested in seeing new numbers.
>>>>
>>>> Final note: the very unload balancing of this problem will limit how much
>>>> these optimizations will help. For a perfectly load balanced problem I
>>>> think these optimizations would a lot higher in percentage (maybe twice as
>>>> much? more?).
>>>>
>>>>
>>>> Barry
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <[email protected]> wrote:
>>>>
>>>>> I would like to get an idea of how much benefit there would be with using
>>>>> a special matrix type for SOR. Here is an experiment on 128 cores of
>>>>> Hopper (Cray XE6), 7 point stencil, with some embedded BCs that look like
>>>>> higher order stencils at BCs. 32^3 subdomian on each core:
>>>>>
>>>>> cheb/jacobi(2)
>>>>> KSPSolve 15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06 1.1e+03
>>>>> 2.8e+02 7 29 67 46 7 26100100100 76 18765
>>>>>
>>>>>
>>>>> rich/eisenstat(2)
>>>>> KSPSolve 15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06 1.1e+03
>>>>> 2.8e+02 12 32 66 44 7 38100100100 76 11659
>>>>>
>>>>>
>>>>> rich/sor
>>>>> KSPSolve 15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06 1.0e+03
>>>>> 3.1e+02 10 21 57 31 8 33100100100 77 15708
>>>>>
>>>>>
>>>>> Complete log files attached. The "projection" solve is the solver of
>>>>> interest.
>>>>>
>>>>> I have 2 Jacobi so that it has about the same amount of work a one
>>>>> (s)sor. There are voids in the domain which I believe accounts for the
>>>>> large differences in the number of flops per core. These were run with
>>>>> the same processor group (i.e., all runs done in the same qsub script)
>>>>>
>>>>> This shows only about 15% potential gain. Should we conclude that there
>>>>> is not much to gain from an optimized data structure?
>>>>>
>>>>> Mark
>>>>> <log_eis><log_jac><log_sor>
>>>>>
>>>>>
>>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <[email protected]> wrote:
>>>>>
>>>>>> "Mark F. Adams" <[email protected]> writes:
>>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>>>>>> problems but for me robustness trumps this for default solver
>>>>>>> parameters in PETSc.
>>>>>>
>>>>>> Richardson/SOR is about the best thing out there in terms of reasonable
>>>>>> local work, low/no setup cost, and reliable convergence. Cheby/Jacobi
>>>>>> just has trivial fine-grained parallelism, but it's not clear that buys
>>>>>> anything on a CPU.
>>>>>>
>>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50% work
>>>>>>> but needs a specialized matrix to get good flop rates. Something to
>>>>>>> think about doing …
>>>>>>
>>>>>> Mine was too sloppy, Barry got it right. Eisenstat is for Cheby/SOR,
>>>>>> however, and doesn't do anything for Richardson.
>>>>>>
>>>>>> To speed Richardson/SSOR up with a new matrix format, I think we have to
>>>>>> cache the action of the lower triangular part in the forward sweep so
>>>>>> that the back-sweep can use it, and vice-versa. With full caching and
>>>>>> triangular residual optimization, I think this brings 2 SSOR iterations
>>>>>> of the down-smoother plus a residual to 2.5 work units in the
>>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>>>>>> (nonzero initial guess). (This is a strong smoother and frequently, one
>>>>>> SSOR would be enough.)
>>>>>
>>>>
>>>
>>
>