Re: [OPM] Support for more linear solvers: status

Jørgen Kvalsvik Fri, 14 Nov 2014 07:54:04 -0800

On 14. nov. 2014 16:38, Arne Morten Kvarving wrote:

On 14/11/14 15:33, Jørgen Kvalsvik wrote:

I have an update and a few questions regarding my project. For
introduciton and details, please see
http://www.opm-project.org/pipermail/opm/2014-October/000664.html


This will be another wall-of-text, but please bear with me.

Ok, so I have more or less completed my fallback CSR SparseMatrix
representation as well as fixed a few bugs, and started testing the
IncompFlowSolverHybrid (through upscaler-benchmark-relperm) with Petsc
as a backend. So far I've accomplished the following:

* Performance
An issue that was brought up after the announcement was performance.
Using my SparseMatrixBuilder, now modified to rely on std::map instead
of std::vector, I am able to remove all allocation code in the
IncompFlowSolverHybrid. I consider this a win, because it
significantly simplifies setting up sparse matrices and leaks fewer
implementation details because we don't have to circumvent
Dune::BCRSMatrix somewhat clumsy interface (sorry, Markus!).

$ git log --stat IncompFlowSolverHybrid.hpp
opm/porsol/mimetic/IncompFlowSolverHybrid.hpp
 1 file changed, 79 insertions(+), 462 deletions(-)

In addition, using this construction method into using
opm/core/linalg/LinearSolverIstl (which converts from flat array CSR
into Dune matrices) I am able to reduce the running time of
upscale-benchmark-relperm. I attribute this to a more efficient
allocation and instantiation of the matrices, as the BCRSMatrix now
can be build row-wise and not in the more inefficient random mode.

Running the benchmark with this new implementation using Dune-istl
with CG/ILU(0) on my Intel [email protected] I get the following output:

Wallclock timing:
Input- and grid processing: 2.65775 sec
Upscaling:                  143.47 sec
Total wallclock time:       146.128 sec  (2 min 26.1279 sec)

Do the numbers look ok? The original, upstream code gives the following:
Wallclock timing:
Input- and grid processing: 2.75897 sec
Upscaling:                  171.677 sec
Total wallclock time:       174.436 sec  (2 min 54.4357 sec)

my machine is a bit beefier but the numbers seems sane;

model name      : Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
cache size      : 10240 KB

Wallclock timing:
Input- and grid processing: 14.0491 sec
Upscaling:                  97.9011 sec
Total wallclock time:       111.95 sec  (1 min 51.9501 sec)



Note that this code is almost drop-in compatible with using Petsc as
the linear solver backend.

* Petsc compability
The second thing I've accomplished is to run the benchmark using Petsc
as the linear solver. Petsc support has yet to be merged into opm-core
upstream, and still has a few issues that need to be resolved before
this can happen, but is on the way. The benchmark reports correct
results, but I still experience some performance issues, which
hopefully will be discussed in this thread. I still consider it a win
already, simply because it proves that it is possible to have simple
support for multiple solvers, possibly with performance improvements
to boot!

Running the same benchmark with CG/ILU on petsc:
Wallclock timing:
Input- and grid processing: 5.40389 sec


why did this double ?

Most likely an anomaly. I can run more tests later to confirm.

Upscaling: 445.309 sec
Total wallclock time:       450.713 sec  (7 min 30.7128 sec)

Which brings me to the questions:
Petsc obviously performs a LOT worse than Dune. I ran the benchmark in
callgrind which revealed that it spends ~48% of its time inside
petsc's PCApply. Another 43% is spent in KSP_MatMult.


first; dune-istl *is* fast. in some of the more challenging stuff i
investigated it was about 30% faster than petsc. but this is far beyond
that.

the numbers do look sane as far as 50% being spent on MxV (~ N^2) and
50% being spent on the preconditioner (substitution phase is also ~ N^2).

it would be interesting to see the convergence history,
-ksp_monitor_true_residual for petsc one.

I can test with this later.

did you do any renumbering for the ilu in either setups?

Not that I'm aware of. For the Dune setup I use the code provided bycore/linalg/LinearSolverIstl. Petsc uses the code from my Petsc branchhttps://github.com/jorgekva/opm-core/tree/linsolver_petsc + a non-pushedpatch that fixes matrix construction. It uses

MatCreateSeqAIJWithArrays for that.


Unfortunately I'm not familiar enough with linear methods to actually
consider if this is reasonable or not, so I ask here. Have I
configured petsc wrong or is this to be expected. Inspecting the
callgrind output of the Dune one leads me to think that this is ok,
because it spends approx. 37% time in SeqILU0::apply.

It's worth mentioning that both Dune and Petsc uses a comparable
amount of iterations - the variance between them is at most in the
order of 50 iterations, something that can probably be attributed to


hmm, i'd say 50 iterations is a lot on a properly configured setup. it
does smell slightly like tolerance differences.

This could very well be.

Petsc taking a lot more parameters. They also both produce correct
output. Using ksp_view in petsc gives:

KSP Object: 1 MPI processes
  type: cg
  maximum iterations=100000
  tolerances:  relative=1e-12, absolute=1e-05, divergence=100000


1e-12 is quite strict. did you use the same tolerance for istl? 1e-5 as
the absolute on the other hand might be a bit high (not the problem here
since it certainly didn't stop iterating).

Yes. This is the tolerance set in upscale-benchmark-relperm.

left preconditioning
  using nonzero initial guess


did you knoll? does istl start out with knoll as well?

I'm afraid I don't know what knoll is, so I wouldn't know.

using PRECONDITIONED norm type for convergence test
PC Object: 1 MPI processes
  type: ilu
    ILU: out-of-place factorization
    0 levels of fill
    tolerance for zero pivot 2.22045e-14
    using diagonal shift on blocks to prevent zero pivot [INBLOCKS]
    matrix ordering: natural
    factor fill ratio given 1, needed 1
      Factored matrix follows:
        Matrix Object:         1 MPI processes
          type: seqaij
          rows=102697, cols=102697
          package used to perform factorization: petsc
          total: nonzeros=1030113, allocated nonzeros=1030113
          total number of mallocs used during MatSetValues calls =0
            not using I-node routines
  linear system matrix = precond matrix:
  Matrix Object:   1 MPI processes
    type: seqaij
    rows=102697, cols=102697
    total: nonzeros=1030113, allocated nonzeros=0
    total number of mallocs used during MatSetValues calls =0
      not using I-node routines
KSP Iterations 145, Final Residual 9.31856e-06

petsc didn't pick up on the block size. should it have ?

I don't think so, should it? I understand that the matrix representationshouldn't be blockwise, just plain CSR.

for a typical application. Does it look mis-configured somehow? Or is
it just that Dune is THAT much faster? If so I am -very- impressed.


Allright, next question:
The problem with unifying several solvers is that they all take
different parameters, name options and methods differently etc. Now,
there are plenty of ways to work with this, including:

#1: Lowest common denominator. We provide a specific feature set that
we support and say that our implementation only allows specific
computations. This has the benefit of providing a simple interface
that allows substitution of solvers with easy. The drawback is
obviously that some configuration opportunities are discarded.

#2: Use a dynamic configuration method (such as ParameterGroup) that
basically forwards options to the solver. The main drawback here is
that the solvers really aren't unified at all, as every call to it
usually must be special-cased for every single solver. Of course it
then exposes the full power of the underlying solver.

#3: A hybrid. A well-defined supported interface and operations with
an "unsafe & unportable" feature that allows for direct configuration.
This sort-of breaks encapsulation, but if it is documented as unsafe
and is only used for "emergencies" then I think it could be fine.

With option #1 or #3 some standardized mechanism for translating
between our option "language" and the target solver option setting is
needed. LinearSolverInterface currently doesn't support it directly -
it does allow options to be passed through ParameterGroup, but is not
very well defined in exactly what options should look like. I
personally don't like it because it is impossible to verify statically.

The real question is: what solution do the community think it is worth
going for? If LinearSolver* is to be used then it does require a
little bit more work. Personally I prefer solution #3, but I'd love
some community feedback on that.

my opinion: 3.

1 is silly and offers no benefits, 2 is hard on end users, 3 makes every
party happy.

1st, 2nd, 3rd & sold.


in particular pulling in petsc must be for the advanced functionality.
there is no sense pulling all that for the basics. then again everything
petsc can be done through command line so as long as that interface is
intact, that might suffice for that particular situation.

In it's current state it isn't, but this is the longer-term goal.

_______________________________________________
Opm mailing list
[email protected]
http://www.opm-project.org/mailman/listinfo/opm

Re: [OPM] Support for more linear solvers: status

Reply via email to