On Aug 28, 2013, at 11:35 AM, "Mark F. Adams" <[email protected]> wrote:
> Barry,
>
> The next step is caching the upper part of the Mat-Vec for residuals.
>
> How should this be engineered?
>
> 1) Should we add an array like ssor_work or hack and reuse it?
It would be ssor_work but we could perhaps give it a more indicative name.
That indicates it is the partial row "sums"
>
> 2) Can we just assume that if there is zero initial guess that we want to
> cache it?
I would just always cache the partial sums when ssor smoothing. I don't
think the cost is that high. So basically all the kernels would have this thing
like
t[i] = sum;
>
> 3) How are we going to intercept the residual call? We only use this for MG
> so we could modify the V-cycle to call a special residual method or check the
> matrix to see if has the upper part stashed…
MGSetComputeResidual() already exists so this is how you would prorovide
the "special residual method"
>
> 4) And I made a pull request for the first step of SOR optimization. I will
> need to build on that … should I remove my pull request and keep working in
> the same branch or can this be pushed (to next?) so that I can create a new
> branch from this?
I'd like to see your stuff merged into next and then you make a new branch
at that point to optimize the residual part.
Barry
>
> Mark
>
> On Aug 27, 2013, at 11:04 PM, Barry Smith <[email protected]> wrote:
>
>>
>> So about a 4.5 percent decrease in time with a 9 percent decrease in flops
>> used. And pretty much on target with what was predicted. Note that if we
>> just counted decreased number of memory loads (and ignored the worse memory
>> access patterns introduced) we would have gotten the same over optimistic
>> prediction that we get with the flop reduction. Nice tiny little case study
>> of worse memory access patterns slowing down the improvement. Presumably
>> you'd see a roughly similar decrease with the "improved" residual
>> computation.
>>
>>
>> Thanks
>>
>> Barry
>>
>> A few percent here, a few percent there, pretty soon you're beating hypre
>> boomeramg :-)
>>
>>
>> On Aug 27, 2013, at 9:01 PM, "Mark F. Adams" <[email protected]> wrote:
>>
>>> [new data: I had an error in the old ex54 timings]
>>>
>>> OK, well the blocked version of this optimization was a fun ride. I still
>>> get about 2% variance on timings but here are the best out of a few runs of
>>> each solver with the new optimized code. I do watch the activity monitor
>>> and these are one core runs on a 4 core machine.
>>>
>>> Scalar (ex54):
>>> KSPSolve 1 1.0 1.3231e+00 1.0 1.40e+09 1.0 0.0e+00 0.0e+00
>>> 5.0e+00 29 72 0 0 3 33 72 0 0 3 1057
>>> 2-vector (ex55)
>>> KSPSolve 1 1.0 2.1459e+00 1.0 3.15e+09 1.0 0.0e+00 0.0e+00
>>> 7.0e+00 50 88 0 0 4 100100 0 0100 1468
>>> 3-vector (ex56)
>>> KSPSolve 1 1.0 9.3956e-01 1.0 1.52e+09 1.0 0.0e+00 0.0e+00
>>> 7.0e+00 26 65 0 0 5 100100 0 0100 1619
>>>
>>> and the old version:
>>> Scalar (ex54):
>>> KSPSolve 1 1.0 1.3830e+00 1.0 1.53e+09 1.0 0.0e+00 0.0e+00
>>> 5.0e+00 29 74 0 0 3 33 74 0 0 3 1103
>>> 2-vector (ex55)
>>> KSPSolve 1 1.0 2.1489e+00 1.0 3.45e+09 1.0 0.0e+00 0.0e+00
>>> 7.0e+00 50 89 0 0 4 100100 0 0100 1606
>>> 3-vector (ex56)
>>> KSPSolve 1 1.0 9.8354e-01 1.0 1.67e+09 1.0 0.0e+00 0.0e+00
>>> 7.0e+00 27 67 0 0 5 100100 0 0100 1700
>>>
>>> Mark
>>>
>>> On Aug 23, 2013, at 3:21 PM, Barry Smith <[email protected]> wrote:
>>>
>>>>
>>>> Run a bit bigger problem. The results shouldn't be that noisy; unless you
>>>> are playing some game in another window :-).
>>>>
>>>> Barry
>>>>
>>>> On Aug 23, 2013, at 11:52 AM, Mark F. Adams <[email protected]> wrote:
>>>>
>>>>> This code looks funny in MatSOR_SeqAIJ:
>>>>>
>>>>> x[i] = (1-omega)*x[i] + sum*idiag[i];
>>>>>
>>>>> shouldn't this be:
>>>>>
>>>>> x[i] = (1-omega)*x[i] + omega*sum*idiag[i];
>>>>>
>>>>> and I've done the first optimization (for the non-blocked version) and
>>>>> get this (running each many time because my Mac is a little noisy). Flop
>>>>> rates go down a bit and total flops got down a lot (perhaps I have a flop
>>>>> counting bug?). 6 old runs then7 new ones:
>>>>>
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3511e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1105
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3611e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1097
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3464e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1109
>>>>> 12:40 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3487e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3629e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 15 34 0 0 0 1095
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3492e-01 1.0 1.49e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 34 0 0 0 14 34 0 0 0 1107
>>>>> 12:41 dummy ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make -f
>>>>> make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2776e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 11 29 0 0 0 14 29 0 0 0 910
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2809e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 908
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2765e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 911
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.3131e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 886
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2792e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 909
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2913e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 901
>>>>> 12:41 madams/sor-opt ~/Codes/petsc/src/ksp/ksp/examples/tutorials$ make
>>>>> -f make-local runex54 | grep MatSOR
>>>>> MatSOR 60 1.0 1.2889e-01 1.0 1.16e+08 1.0 0.0e+00 0.0e+00
>>>>> 0.0e+00 12 29 0 0 0 14 29 0 0 0 902
>>>>>
>>>>>
>>>>> On Aug 20, 2013, at 6:40 PM, Barry Smith <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Aug 20, 2013, at 4:39 PM, Jed Brown <[email protected]> wrote:
>>>>>>
>>>>>>> Hmm, less clear flops benefit. Half of these are nonzero initial guess,
>>>>>>> for which triangular storage would be worse
>>>>>>
>>>>>> Is it clear that triangular storage would be particularly worse? I think
>>>>>> only a small percentage worse for a multiply. The problem with CSR for
>>>>>> triangular solves is for each new row one needs to jump some amount in
>>>>>> the index and double array wasting cache lines and limiting streaming.
>>>>>> With multiply with triangular storage there is no jumping, just
>>>>>> streaming two different index and double arrays and current CPUs have no
>>>>>> problem managing 4 streams.
>>>>>>
>>>>>> In other words CSR good for multiply, bad for triangular solve;
>>>>>> triangular storage good for triangular solve and pretty good for
>>>>>> multiply.
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>
>>>>>> Barry
>>>>>>
>>>>>>> (unless you add an Eisenstat optimization).
>>>>>>>
>>>>>>> On Aug 20, 2013 4:26 PM, "Mark F. Adams" <[email protected]> wrote:
>>>>>>> Barry,
>>>>>>>
>>>>>>> These are tests using the well load balanced problems in KSP on my Mac:
>>>>>>>
>>>>>>> 3D Vector (ex56)
>>>>>>>
>>>>>>> 8 proc:
>>>>>>>
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve 1 1.0 3.0143e-01 1.0 7.98e+07 1.1 1.0e+04
>>>>>>>> 7.2e+02 8.5e+01 19 26 26 16 8 100100100100100 2047
>>>>>>>
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve 1 1.0 2.5836e-01 1.0 6.87e+07 1.1 1.4e+04
>>>>>>>> 7.3e+02 8.8e+01 17 25 27 19 8 100100100100100 2053
>>>>>>>
>>>>>>> 1 proc:
>>>>>>>
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve 1 1.0 1.8541e-01 1.0 3.20e+08 1.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 10 22 0 0 0 100100 0 0 0 1727
>>>>>>>
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve 1 1.0 2.4841e-01 1.0 4.65e+08 1.0 0.0e+00
>>>>>>>> 0.0e+00 0.0e+00 12 24 0 0 0 100100 0 0 0 1870
>>>>>>>
>>>>>>> 2D Scalar (ex54) 4 procs
>>>>>>>
>>>>>>> cheb/jac(2)
>>>>>>>> KSPSolve 1 1.0 2.3614e-01 1.0 3.51e+07 1.0 2.6e+03
>>>>>>>> 8.7e+02 7.8e+02 91100 98 93 95 100100100100100 592
>>>>>>>
>>>>>>> rich/ssor(1)
>>>>>>>> KSPSolve 1 1.0 2.0144e-01 1.0 2.29e+07 1.0 1.8e+03
>>>>>>>> 8.8e+02 7.1e+02 89100 98 90 95 100100100100100 453
>>>>>>>
>>>>>>> I'm not sure if this matters but I wanted to get off of the funny test
>>>>>>> problem that I was using.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On Aug 17, 2013, at 5:30 PM, Barry Smith <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Mark,
>>>>>>>>
>>>>>>>> Why rich/eisenstat(2) ? With Cheb/eisenstat you get the acceleration
>>>>>>>> of Cheby on top of the SOR but rich/eisenstat(2) just seems like some
>>>>>>>> strange implementation of sor?
>>>>>>>>
>>>>>>>> I am guessing you are getting your 15% potential improvement from
>>>>>>>>
>>>>>>>>>>> 7.63/11.3
>>>>>>>> 0.6752212389380531
>>>>>>>>>>> .6752212389380531*5.9800e+00
>>>>>>>> 4.037823008849558
>>>>>>>>>>> .63/4.63
>>>>>>>> 0.13606911447084233
>>>>>>>>
>>>>>>>> Anyways I would focus on rich/ssor() smoothing.
>>>>>>>>
>>>>>>>> Before thinking about implementing a new data structure I think it is
>>>>>>>> relatively easy to reduce more the total number of flops done with
>>>>>>>> ssor(1) smoother [and despite the current conventional wisdom that
>>>>>>>> doing extra flops is in a good thing in HPC :-)).
>>>>>>>>
>>>>>>>> Presmoothing: Note that since you are running ssor(1) and zero initial
>>>>>>>> guess the current implementation is already good in terms of reducing
>>>>>>>> total flops; it does one down solve (saving the intermediate values)
>>>>>>>> and then one up solve (using those saved values).
>>>>>>>>
>>>>>>>> Postsmoothing: Here it is running SSOR(1) with nonzero initial guess,
>>>>>>>> the current implementation in MatSOR_SeqAIJ() is bad because it
>>>>>>>> applies the entire matrix in both the down solve and the up solve.
>>>>>>>>
>>>>>>>> while (its--) {
>>>>>>>> if (flag & SOR_FORWARD_SWEEP || flag & SOR_LOCAL_FORWARD_SWEEP) {
>>>>>>>> for (i=0; i<m; i++) {
>>>>>>>> n = a->i[i+1] - a->i[i];
>>>>>>>> idx = a->j + a->i[i];
>>>>>>>> v = a->a + a->i[i];
>>>>>>>> sum = b[i];
>>>>>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>>>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>>>>>> }
>>>>>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>>>>> }
>>>>>>>> if (flag & SOR_BACKWARD_SWEEP || flag & SOR_LOCAL_BACKWARD_SWEEP) {
>>>>>>>> for (i=m-1; i>=0; i--) {
>>>>>>>> n = a->i[i+1] - a->i[i];
>>>>>>>> idx = a->j + a->i[i];
>>>>>>>> v = a->a + a->i[i];
>>>>>>>> sum = b[i];
>>>>>>>> PetscSparseDenseMinusDot(sum,x,v,idx,n);
>>>>>>>> x[i] = (1. - omega)*x[i] + (sum + mdiag[i]*x[i])*idiag[i];
>>>>>>>> }
>>>>>>>> ierr = PetscLogFlops(2.0*a->nz);CHKERRQ(ierr);
>>>>>>>> }
>>>>>>>>
>>>>>>>> (Note also that this partially explains why the flop rate in the table
>>>>>>>> for MatSOR() s so good, two of the three ssor triangular solves are
>>>>>>>> actually full matrix vector products). What it should do is apply the
>>>>>>>> entire matrix in the down solve (but save the partial sums of the
>>>>>>>> lower triangular part) and then it can do the upper solve applying
>>>>>>>> only 1/2 the matrix and using the accumulated values. So currently
>>>>>>>> the combined pre and post smooth has 3 complete applications of the
>>>>>>>> matrix (.5 and .5 from the pre smooth and 1 and 1 from the up smooth.
>>>>>>>> Adding the optimization will reduce it to 2.5 applications of the
>>>>>>>> matrix giving a reduction of 1/6 = 17 percent of the smooth flops or
>>>>>>>>
>>>>>>>>>>> 17.*14/33
>>>>>>>> 7.212121212121212
>>>>>>>>
>>>>>>>> 7 percent decrease in the total KSPSolve flops
>>>>>>>>
>>>>>>>> Next we can eliminate 1/2 of the application of the matrix in the
>>>>>>>> residual computation after the smoothing if we have saved the partial
>>>>>>>> sums of the upper triangular solve in the smoother. We can estimate
>>>>>>>> the flops of the matrix-multiple in the residual computation as one
>>>>>>>> application of the matrix and hence equal to 1/3 of the flops in the
>>>>>>>> ssor computations. Eliminating 1/2 of that means eliminating the
>>>>>>>> equivalent of eliminating 1/6 of the flops in the MatSOR computation
>>>>>>>> which is
>>>>>>>>
>>>>>>>>>>> 14./(33*6)
>>>>>>>> 0.0707070707070707 (actually the same computation as above :-(
>>>>>>>>
>>>>>>>> so another 7 percent of the total flops in the KSPSolve. So in total
>>>>>>>> we could save 14 percent of the flops (and lots of memory access) in
>>>>>>>> the KSPSolve at the work of editing a couple of functions.
>>>>>>>>
>>>>>>>> At that point we could compute the flop rate of all the SSOR
>>>>>>>> triangular solves and the percent of the time of the KSPSolves() in
>>>>>>>> the them to determine if a new data structure is warranted. Note that
>>>>>>>> in reducing the total number of flops we are replacing two full
>>>>>>>> matrix-vector products with 2 triangular solves hence the flop rate
>>>>>>>> will be lower indicating that using a new data structure will actually
>>>>>>>> help more.
>>>>>>>>
>>>>>>>> Notes on the implementation:
>>>>>>>>
>>>>>>>> * modifying the MatSolve_SeqAIJ() to better handle the nonzero initial
>>>>>>>> guess is straightforward; edit one functions
>>>>>>>>
>>>>>>>> * modifying the residual computation in the multigrid to reuse the
>>>>>>>> upper triangular sums requires a tiny bit more thought. Basically
>>>>>>>> MatSOR_SeqAIJ() would have an internal pointer to the accumulated
>>>>>>>> solves for the upper triangular part and we would need to either (1)
>>>>>>>> have a MatMult_SeqAIJ() implementation that would retrieve those
>>>>>>>> values and use them appropriately (how would it know when the values
>>>>>>>> are valid?) or (2) use the PCMGSetResidual() to provide a custom
>>>>>>>> residual routine that again pulled out the accumulated sums and used
>>>>>>>> them. Of course we will eventually figure out how to organize it
>>>>>>>> cleanly for users.
>>>>>>>>
>>>>>>>>
>>>>>>>> Note also the even if a new custom data structure is introduced the
>>>>>>>> work outlined above is not lost since the new data structure still
>>>>>>>> needs those flop reduction optimizations. So step one, fix the
>>>>>>>> MatSOR_SeqAIJ() for nonzero initial guess, get a roughly 6 percent
>>>>>>>> improvement, step 2, optimize the residual computation, get another 6
>>>>>>>> percent improvement, step 3, introduce a new data structure and get
>>>>>>>> another roughly 5 percent improvement in KSPSolve() (the 6, 6 and 5
>>>>>>>> are my guesstimates).
>>>>>>>>
>>>>>>>> I would be very interested in seeing new numbers.
>>>>>>>>
>>>>>>>> Final note: the very unload balancing of this problem will limit how
>>>>>>>> much these optimizations will help. For a perfectly load balanced
>>>>>>>> problem I think these optimizations would a lot higher in percentage
>>>>>>>> (maybe twice as much? more?).
>>>>>>>>
>>>>>>>>
>>>>>>>> Barry
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 17, 2013, at 3:16 PM, "Mark F. Adams" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I would like to get an idea of how much benefit there would be with
>>>>>>>>> using a special matrix type for SOR. Here is an experiment on 128
>>>>>>>>> cores of Hopper (Cray XE6), 7 point stencil, with some embedded BCs
>>>>>>>>> that look like higher order stencils at BCs. 32^3 subdomian on each
>>>>>>>>> core:
>>>>>>>>>
>>>>>>>>> cheb/jacobi(2)
>>>>>>>>> KSPSolve 15 1.0 5.9800e+00 1.0 1.13e+09 3.2 6.2e+06
>>>>>>>>> 1.1e+03 2.8e+02 7 29 67 46 7 26100100100 76 18765
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> rich/eisenstat(2)
>>>>>>>>> KSPSolve 15 1.0 1.1563e+01 1.0 1.37e+09 3.4 5.4e+06
>>>>>>>>> 1.1e+03 2.8e+02 12 32 66 44 7 38100100100 76 11659
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> rich/sor
>>>>>>>>> KSPSolve 15 1.0 4.6355e+00 1.0 7.63e+08 4.5 3.2e+06
>>>>>>>>> 1.0e+03 3.1e+02 10 21 57 31 8 33100100100 77 15708
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Complete log files attached. The "projection" solve is the solver of
>>>>>>>>> interest.
>>>>>>>>>
>>>>>>>>> I have 2 Jacobi so that it has about the same amount of work a one
>>>>>>>>> (s)sor. There are voids in the domain which I believe accounts for
>>>>>>>>> the large differences in the number of flops per core. These were
>>>>>>>>> run with the same processor group (i.e., all runs done in the same
>>>>>>>>> qsub script)
>>>>>>>>>
>>>>>>>>> This shows only about 15% potential gain. Should we conclude that
>>>>>>>>> there is not much to gain from an optimized data structure?
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>> <log_eis><log_jac><log_sor>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 16, 2013, at 7:53 PM, Jed Brown <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> "Mark F. Adams" <[email protected]> writes:
>>>>>>>>>>> Some hypre papers have shown that cheb/jacobi is faster for some
>>>>>>>>>>> problems but for me robustness trumps this for default solver
>>>>>>>>>>> parameters in PETSc.
>>>>>>>>>>
>>>>>>>>>> Richardson/SOR is about the best thing out there in terms of
>>>>>>>>>> reasonable
>>>>>>>>>> local work, low/no setup cost, and reliable convergence.
>>>>>>>>>> Cheby/Jacobi
>>>>>>>>>> just has trivial fine-grained parallelism, but it's not clear that
>>>>>>>>>> buys
>>>>>>>>>> anything on a CPU.
>>>>>>>>>>
>>>>>>>>>>> Jed's analysis suggests that Eisenstat's method saves almost 50%
>>>>>>>>>>> work
>>>>>>>>>>> but needs a specialized matrix to get good flop rates. Something to
>>>>>>>>>>> think about doing …
>>>>>>>>>>
>>>>>>>>>> Mine was too sloppy, Barry got it right. Eisenstat is for Cheby/SOR,
>>>>>>>>>> however, and doesn't do anything for Richardson.
>>>>>>>>>>
>>>>>>>>>> To speed Richardson/SSOR up with a new matrix format, I think we
>>>>>>>>>> have to
>>>>>>>>>> cache the action of the lower triangular part in the forward sweep so
>>>>>>>>>> that the back-sweep can use it, and vice-versa. With full caching
>>>>>>>>>> and
>>>>>>>>>> triangular residual optimization, I think this brings 2 SSOR
>>>>>>>>>> iterations
>>>>>>>>>> of the down-smoother plus a residual to 2.5 work units in the
>>>>>>>>>> down-smooth (zero initial guess) and 3 work units in the up-smooth
>>>>>>>>>> (nonzero initial guess). (This is a strong smoother and frequently,
>>>>>>>>>> one
>>>>>>>>>> SSOR would be enough.)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>