Re: Fast Fipy - Comparing Trilinos with PySparse

Jonathan Guyer Tue, 10 Nov 2009 11:57:54 -0800


On Nov 9, 2009, at 5:20 PM, William Gathright wrote:

Hello All,

We appreciate this is a long email.  If you have any experience
using Trilinos, please read on at least to the start of 1.0.

.
.
.

Will,

Thanks for your extensive notes. You've given us a lot to chew on. Acouple of thoughts based on a rapid scan of what you've done:

LU is probably not the best solver to test. It is a direct, ratherthan iterative, solver and does not exploit any of the preconditioningthat Trilinos is particularly adept at. Years ago, I did somebenchmarking of PySparse against SciPy and concluded that PySparse'sLU implementation achieves extraordinary speed at the cost ofextraordinary memory. This is not an issue for the few thousand cellsthat you are testing here, but it becomes prohibitive as simulationsizes increase.

When Trilinos is faster, it's not because its solvers areintrinsically faster (they should be about the same), but becauseTrilinos has preconditioners and solvers (e.g. algebraic multigrid)that PySparse does not offer, or that are broken in PySparse (e.g.GMRES). Daniel uses Trilinos in his work because the combination ofpreconditioners and parallelism allow him to solve the problem at all.

Thanks again for writing this all up in such detail. Further tests,particularly with, e.g. a conjugate gradient solver of both speed andmemory usage would be valuable. GMRES would be ideal, but it'sunusable in PySparse.


- Jon


Lately there have been several comments on the list about the
selection of an appropriate solver.  This issue has been of
real concern to us as well because of the need to speed up if
possible our FiPy phase field code.  It currently takes between
hours and days to equilibrate our system.

One attempt to satisfy this "need for speed" was to use Trilinos,
which was reported by some folks to be significantly faster than
PySparse.  This has not been our experience.  We reported our
failure to this List about 6 weeks ago to and received some very
helpful comments.  (The previous exchange is
attached as "A4-Previous_Correspondence".)

Wheeler in particular suggested several lines of investigation,
and we have tried to follow up on as many as we could.  In an
effort to be as thorough as possible, this report has grown to
a rather embarrassing length, which may make it unsuitable for
a general posting -- indeed, may make it inappropriate for any
posting.

Under these circumstances, we hoped that perhaps you would
do us a favor and look through this note and let us know how we
might better present the issue and the materials gathered.

#######################
#  0.0)  The problem  #
#######################

On our system, pysparse is about 30% faster on every single prob-
lem for which we have tried to use it.  Following D. Wheeler's sug-
gestion to come up with a more rigorous test environment, we have
concentrated on just one solver, LinearLUSolver, available for both
trilinos and pysparse.

Granted that LinearLU may not be the best solver to use (WheelersuggestsGMRES). Regardless, if anyone has an example code (using any set ofsolvers)

in which Trilinos solves a problem more quickly than PySparse, we
would love to study it.

We hope this report will
o demonstrate beyond a reasonable doubt that trilinos is indeed
 slower than pysparse (on our system)

o encourage others to use the attached test code, or publish their
 own efforts in this area, and report on the behaviour of their
 FiPy installations, concerning pysparse and trilinos

o solicit ideas for further tests and opinions as to how to improve
 the situation


#######################
#  0.1)  Conclusions  #
#######################

o  The trilinos LinearLUSolver is slower than pysparse's on our
  machine, under all circumstances.  Many controlled experi-
  ments with this solver demonstrate this.  In no case, even
  with different solvers, has trilinos been faster, though we
  haven't rigorously tested any others to the same extent as
  this one.  [Section 1]

o  We hope that others will run the test program A1-Xmesh1D.py
  and report their experience.  [See attached python code, "A1-
  Xmesh1D.py"]  Additionally, if anyone from the community has
  another example program that can serve to test a pair of
  pysparse/trilinos solvers, we would be overjoyed to give it
  a test drive.

o  We don't know the best way to organize the profiler output
  in order to best compare the performance between solvers.
  Ideas about how better to make use of the profiler would be
  most welcome.

o  A very limited understanding of the profiler output makes us
  wonder if it is possible that one cause of this slow down is
  that the implementation of trilinos within FiPy incurs sig-
  nificant additional overhead, before the trilinos solvers
  themselves are reached?  [Section 1.3]


#####################################
#  1.0)  Demonstrating The Problem  #
#####################################

Attachment A1 (A1-Xmesh1D.py) is a modified version of an
example program, taken from the FiPy 2.0.2 release.  You will
find the original in

[Your_FiPy_Distribution]/examples/elphf/diffusion/mesh1D.py

(In earlier releases, there is a nearly identical example in
../examples/elphf/diffusion/input1D.py.)

We modified the code to parameterize some of the features that
may affect execution speed and to re-organize things in a way
that would isolate the calls to the solver.  We feel confident
that we have changed nothing that affects actual operation of
the solver itself.


####################################################
#  1.1)  Pysparse vs. Trilinos:  The Nominal Case  #
####################################################

You can find some details concerning the source code in section
2).  But first the results.  We collected timing data using the
time() method to evaluate the relative performance of the py-
sparse and trilinos implementations of the LinearLUSolver, in
as near-to-real-life conditions as possible.  [Section 3.0 pro-
vides a few justifications for the use of time() as a performance
measurement tool.]

As covered in Section 2, we did go to some lengths, as D.
Wheeler suggested, to make sure this is as meaningful a side-
by-side comparison as possible.  For example, we explicitly
set the solver's tolerance and iterations parameters before
calling the equation's solve method.  We did *_not_* look
deeper into the code for other adjustments -- that is, we
didn't delve into Sandia's PyTrilinos code to try and fiddle
with any other parameters.

For this particular test, we ran our Xmesh1D test code 100
times, for both pysparse and trilinos.  Here is what we found.

(iterations=10; tolerance=1e-10)
 Solver        mean       std
  name      exec time
================================
 trilinos    1.263 sec   0.0028
 pysparse    0.930 sec   0.0044


Just for grins, we set the iterations and tolerance to dif-
ferent values and ran the experiment again.

(iterations=1; tolerance=1e-5)
 Solver        mean       std
  name      exec time
================================
 trilinos    1.070 sec   0.0070
 pysparse    0.864 sec   0.0029

(Interestingly, setting the tolerance to 0.1 also converged,
in this particular example.)


######################################################
#  1.2)  Pysparse vs. Trilinos:  Number Of Elements  #
######################################################

One of the parameters to Xmesh1D is the number of mesh points,
hard-wired in the original mesh1D example to be 400.  Here are
some time()ing results for experiments with more elements.

 num    pysparse  trilinos
 elem     time      time
============================
 1000    1.2969    1.6674
 2000    1.8354    2.4659
 3000    2.4140    3.2246
 4000    2.9871    4.0160
 5000    3.5379    4.8778
 6000    4.1537    5.7003
 7000    4.7952    6.4856
 8000    5.4087    7.3058
 9000    6.0327    8.2038
10000    6.7551    9.0214

[All these tests were run with loop=45, iterations=10, and
tolerance=1e-10.  We used loop=45 (instead of 40), because
the 10,000-element case doesn't converge with fewer times
through the equilibrium-loop.]

Trilinos is considerably slower in these tests, as it has
been in every test we have conducted, including all the
"uncontrolled", more informal attempts to use it.


####################################################
#  1.3)  Pysparse vs. Trilinos:  Profiler Results  #
####################################################

We collected numerous statistics files from cProfile.  How
best to view and interpret these results is a bit of a mys-
tery, but here are some preliminary attempts.  Suggestions
for more meaningful analyses than the ones appearing below
would be much appreciated.

In addition to the call-it-100x and the number-of-elements
tests already mentioned, we also tried varying the number of
times through the "equilibrium loop".  This value was changed
not to achieve a "better" solution, but to tease out the dif-
ference between the pysparse and trilinos solvers.

Attached are two files providing details from cProfile for
an experiment in which the equilibrium loop executed 80 times
(rather than the nominal 40).  As the document names suggest,
one contains the results of the trilinos test and one the py-
sparse test.  In both we used similar restrictions so that
the two files can be viewed (literally) side-by-side, to see
the differences.  As can be seen in the included cProfile
output, these filters were applied

 print_stats('fipy','rilin',8)   [the view from the top]
or
 print_stats('fipy','sparse',8)

and
 print_stats('{',20) [underlying C and FORTRAN routines]

One immediately notices in the "fipy" comparisons that a con-
siderable amount of time is spent on the trilinos side making
calls that seem to have no analog in pysparse (like
_numpyToTrilinosVector).  Also, where the analogous tool is
employed, much more time may be required (addAt, for example).
In the case of the solver itself, 2x as much time is spent in
trilinos.  Interestingly, trilinos takes less time in a few
of the cases... just not big-ticket items.

Another feature of these print outs that may be of relevance
to the time difference is the number of both functions and
function calls

           functions      function calls
=========================================
trilinos       358            881,970
pysparse       282            708,097

The effect of these additional calls is likely showing up
in the "print_stats('{',20)" display.  Note the difference
in time required for "numpy.core.multiarray.array" and
"method 'view'"; a likely explanation being that there
were many more calls to these routines in the trilinos
case, compared with pysparse.

Even knowing that there is no such thing as an "average"
function call, it is difficult to avoid imagining a connection
between the observations that pysparse has 24% fewer
calls and is 30% faster.

[It looks like FiPy implements much of it's own version
of pysparse, such that, in the profile, there are *_no_*
direct references to python code in "site-packages/pysparse";
all the pysparse calls are to files within fipy -- except
of course to the pysparse libraries "superlu" and "spmatrix".]

As before, we see here as well the difference in the solve-
time.  If the valid comparison is between pysparse's built-in
method solve" (0.040) and trilinos's _Amesos.BaseSolver_Solve
(0.376), this is a big difference indeed (even adding in the
0.104 sec for pysparse factorize).

This table shows the full results for the through-the-loop
test.

loop    Py_time      Tril_time    Py_calls    Tril_calls
--------------------------------------------------------
40       0.9207       1.2606       351377       415970
60       1.3964       2.0112       529737       648970
80       1.8653       2.7946       708097       881970


##################################
#  2.0)  Some Test Code Details  #
##################################

We would very much appreciate feedback from anyone with
a Trilinos/PyTrilinos installation who wish to test the per-
formance of their system using Xmesh1D.  The code calls
only one solver (LinearLUSolver), but it can pick up either
the pysparse or the trilinos implementation through the use
of a command line flag, as illustrated a few lines below.

A typical invocation of the test code, Xmesh1D.py, looks
like one or these examples

#              loop   tol   iters  elem  profile  solveMode
./Xmesh1D.py   40   1e-10    10    400     1
or
./Xmesh1D.py   40   1e-10    10    400     1     --Trilinos
#              loop   tol   iters  elem  profile  solveMode

A quick look at the source will probably explain these para-
meters best of all, but briefly:

Loop    the number of times "solve" is to be called, in order
     to "iterate the problem to equilibrium", as the com-
     ment in the original says.    40 was hard-wired into
     the mesh1D diffusion example and is certainly the
     "correct" value for solving the equation.

Tol     the tolerance parameter passed along to solve.  By
     default, this is 1e-10 for both pysparse and trilinos.

Iters   the iterations parameter passed along to solve.  The
     pysparse default is 10; for trilinos it is 5.  We set
     it explicitly to to insure a meaningful comparison.

Elem    the number of mesh points.  400 was hard-wired in the
     original code.

Profile whether or not to run the solve loop under the profiler
     (we use cProfile in the Xmesh1D.py source).  A "0"
     means *_not_* to employ the profiler; a "1" causes the
     profiler to be used, and the output to captured to a
     file whose name is automatically concocted from the
     parameters, in a way that will be obvious from the ex-
     ample below.

SolveMode
     by default, FiPy will run against its pysparse imple-
     mentation; appending "--Trilinos" causes trilinos
     to be used.

Entering the command above should cause the program's routine
output to appear on standard out and the profile statistics
to be written to a file named

mesh1D_TRIL_040_1e-10_010_00400.prf

You can of course use "tee" to capture standard out
./Xmesh1D.py  40  1e-10  10  400  1  --Trilinos | tee -a tril.dat


##############################
#  2.1)  A Complete Example  #
##############################

A bash script like this

#!/usr/bin/env bash
# timing info only
#              loop   tol   iters elem profile solveMode
for i in {1..100}
do

./Xmesh1D.py 40 1e-10 10 400 0 | tee -apX100.dat./Xmesh1D.py 40 1e-10 10 400 0 --Trilinos | tee -atX100.dat

done

should run the program 100 times, accumulating the results in the
respective .dat files.  No profile data is collected.  Unix utili-
ties would then allow you to extract the time from these output
files.  For example,

grep time= pX100.dat | cut -d " " -f 3 | cut -d "=" -f 2 > p100


#############################
#  3.0)  Time() Management  #
#############################

The use of time() to gain timing information for a routine is, of
course, problematic.  But then, so are all the other options.  In
our tests we did gather profile statistics, but the profiler causes
it's own difficulties with interpretation.  We believe that time(),
under some circumstances and in particular with repsect to the
way the Xmesh1D example has been constructed and run provides
meaningful performance data.

On an otherwise lightly loaded system time() has proven a very
stable measure of overall XMesh1D performance.  We determined this
in 2 ways.

1) We timed()ed the cProfile calls.  The reports from cProfile
  for CPU time and the elapsed time figure from time() typi-
  cally agree to within a few 1000ths of a sec (and were
  *_always_* within 0.02 sec) for the many comparisons we
  made.  In other words, re-arranging the example so that
  the timing loop encloses very little except the solver code
  and conducting the tests on a 4-processor machine when the
  computer was otherwise unutilized means that the time()ing
  results directly measure differences in performance of the
  implementations under investigation.

2) time() is usually quite stable.  We conducted some time()
  tests hundreds of times and looked at the scatter in the
  result.  The results of one such test were reported in 1.2)
  above.

       (iterations=10; tolerance=1e-10)
        Solver        mean       std
         name      exec time
       ===============================
        trilinos    1.263 sec   0.0028
        pysparse    0.930 sec   0.0044

So, yes, there is an inevitable if probabilistic inaccuracy in time().
But these effects are quite small (considerably less than 1% for
an effect that is on the order of over 30%) and mitigations exist.
In other words, running the tests multiple times or using "larger"
examples on a relatively unutilized system renders negligible the
impact of this inaccuracy.
<A2-Py_profile.txt><A1-Xmesh1D.py><A3-Tril_profile.txt>

Re: Fast Fipy - Comparing Trilinos with PySparse

Reply via email to