Lately there have been several comments on the list about the
selection of an appropriate solver. This issue has been of
real concern to us as well because of the need to speed up if
possible our FiPy phase field code. It currently takes between
hours and days to equilibrate our system.
One attempt to satisfy this "need for speed" was to use Trilinos,
which was reported by some folks to be significantly faster than
PySparse. This has not been our experience. We reported our
failure to this List about 6 weeks ago to and received some very
helpful comments. (The previous exchange is
attached as "A4-Previous_Correspondence".)
Wheeler in particular suggested several lines of investigation,
and we have tried to follow up on as many as we could. In an
effort to be as thorough as possible, this report has grown to
a rather embarrassing length, which may make it unsuitable for
a general posting -- indeed, may make it inappropriate for any
posting.
Under these circumstances, we hoped that perhaps you would
do us a favor and look through this note and let us know how we
might better present the issue and the materials gathered.
#######################
# 0.0) The problem #
#######################
On our system, pysparse is about 30% faster on every single prob-
lem for which we have tried to use it. Following D. Wheeler's sug-
gestion to come up with a more rigorous test environment, we have
concentrated on just one solver, LinearLUSolver, available for both
trilinos and pysparse.
Granted that LinearLU may not be the best solver to use (Wheeler
suggests
GMRES). Regardless, if anyone has an example code (using any set of
solvers)
in which Trilinos solves a problem more quickly than PySparse, we
would love to study it.
We hope this report will
o demonstrate beyond a reasonable doubt that trilinos is indeed
slower than pysparse (on our system)
o encourage others to use the attached test code, or publish their
own efforts in this area, and report on the behaviour of their
FiPy installations, concerning pysparse and trilinos
o solicit ideas for further tests and opinions as to how to improve
the situation
#######################
# 0.1) Conclusions #
#######################
o The trilinos LinearLUSolver is slower than pysparse's on our
machine, under all circumstances. Many controlled experi-
ments with this solver demonstrate this. In no case, even
with different solvers, has trilinos been faster, though we
haven't rigorously tested any others to the same extent as
this one. [Section 1]
o We hope that others will run the test program A1-Xmesh1D.py
and report their experience. [See attached python code, "A1-
Xmesh1D.py"] Additionally, if anyone from the community has
another example program that can serve to test a pair of
pysparse/trilinos solvers, we would be overjoyed to give it
a test drive.
o We don't know the best way to organize the profiler output
in order to best compare the performance between solvers.
Ideas about how better to make use of the profiler would be
most welcome.
o A very limited understanding of the profiler output makes us
wonder if it is possible that one cause of this slow down is
that the implementation of trilinos within FiPy incurs sig-
nificant additional overhead, before the trilinos solvers
themselves are reached? [Section 1.3]
#####################################
# 1.0) Demonstrating The Problem #
#####################################
Attachment A1 (A1-Xmesh1D.py) is a modified version of an
example program, taken from the FiPy 2.0.2 release. You will
find the original in
[Your_FiPy_Distribution]/examples/elphf/diffusion/mesh1D.py
(In earlier releases, there is a nearly identical example in
../examples/elphf/diffusion/input1D.py.)
We modified the code to parameterize some of the features that
may affect execution speed and to re-organize things in a way
that would isolate the calls to the solver. We feel confident
that we have changed nothing that affects actual operation of
the solver itself.
####################################################
# 1.1) Pysparse vs. Trilinos: The Nominal Case #
####################################################
You can find some details concerning the source code in section
2). But first the results. We collected timing data using the
time() method to evaluate the relative performance of the py-
sparse and trilinos implementations of the LinearLUSolver, in
as near-to-real-life conditions as possible. [Section 3.0 pro-
vides a few justifications for the use of time() as a performance
measurement tool.]
As covered in Section 2, we did go to some lengths, as D.
Wheeler suggested, to make sure this is as meaningful a side-
by-side comparison as possible. For example, we explicitly
set the solver's tolerance and iterations parameters before
calling the equation's solve method. We did *_not_* look
deeper into the code for other adjustments -- that is, we
didn't delve into Sandia's PyTrilinos code to try and fiddle
with any other parameters.
For this particular test, we ran our Xmesh1D test code 100
times, for both pysparse and trilinos. Here is what we found.
(iterations=10; tolerance=1e-10)
Solver mean std
name exec time
================================
trilinos 1.263 sec 0.0028
pysparse 0.930 sec 0.0044
Just for grins, we set the iterations and tolerance to dif-
ferent values and ran the experiment again.
(iterations=1; tolerance=1e-5)
Solver mean std
name exec time
================================
trilinos 1.070 sec 0.0070
pysparse 0.864 sec 0.0029
(Interestingly, setting the tolerance to 0.1 also converged,
in this particular example.)
######################################################
# 1.2) Pysparse vs. Trilinos: Number Of Elements #
######################################################
One of the parameters to Xmesh1D is the number of mesh points,
hard-wired in the original mesh1D example to be 400. Here are
some time()ing results for experiments with more elements.
num pysparse trilinos
elem time time
============================
1000 1.2969 1.6674
2000 1.8354 2.4659
3000 2.4140 3.2246
4000 2.9871 4.0160
5000 3.5379 4.8778
6000 4.1537 5.7003
7000 4.7952 6.4856
8000 5.4087 7.3058
9000 6.0327 8.2038
10000 6.7551 9.0214
[All these tests were run with loop=45, iterations=10, and
tolerance=1e-10. We used loop=45 (instead of 40), because
the 10,000-element case doesn't converge with fewer times
through the equilibrium-loop.]
Trilinos is considerably slower in these tests, as it has
been in every test we have conducted, including all the
"uncontrolled", more informal attempts to use it.
####################################################
# 1.3) Pysparse vs. Trilinos: Profiler Results #
####################################################
We collected numerous statistics files from cProfile. How
best to view and interpret these results is a bit of a mys-
tery, but here are some preliminary attempts. Suggestions
for more meaningful analyses than the ones appearing below
would be much appreciated.
In addition to the call-it-100x and the number-of-elements
tests already mentioned, we also tried varying the number of
times through the "equilibrium loop". This value was changed
not to achieve a "better" solution, but to tease out the dif-
ference between the pysparse and trilinos solvers.
Attached are two files providing details from cProfile for
an experiment in which the equilibrium loop executed 80 times
(rather than the nominal 40). As the document names suggest,
one contains the results of the trilinos test and one the py-
sparse test. In both we used similar restrictions so that
the two files can be viewed (literally) side-by-side, to see
the differences. As can be seen in the included cProfile
output, these filters were applied
print_stats('fipy','rilin',8) [the view from the top]
or
print_stats('fipy','sparse',8)
and
print_stats('{',20) [underlying C and FORTRAN routines]
One immediately notices in the "fipy" comparisons that a con-
siderable amount of time is spent on the trilinos side making
calls that seem to have no analog in pysparse (like
_numpyToTrilinosVector). Also, where the analogous tool is
employed, much more time may be required (addAt, for example).
In the case of the solver itself, 2x as much time is spent in
trilinos. Interestingly, trilinos takes less time in a few
of the cases... just not big-ticket items.
Another feature of these print outs that may be of relevance
to the time difference is the number of both functions and
function calls
functions function calls
=========================================
trilinos 358 881,970
pysparse 282 708,097
The effect of these additional calls is likely showing up
in the "print_stats('{',20)" display. Note the difference
in time required for "numpy.core.multiarray.array" and
"method 'view'"; a likely explanation being that there
were many more calls to these routines in the trilinos
case, compared with pysparse.
Even knowing that there is no such thing as an "average"
function call, it is difficult to avoid imagining a connection
between the observations that pysparse has 24% fewer
calls and is 30% faster.
[It looks like FiPy implements much of it's own version
of pysparse, such that, in the profile, there are *_no_*
direct references to python code in "site-packages/pysparse";
all the pysparse calls are to files within fipy -- except
of course to the pysparse libraries "superlu" and "spmatrix".]
As before, we see here as well the difference in the solve-
time. If the valid comparison is between pysparse's built-in
method solve" (0.040) and trilinos's _Amesos.BaseSolver_Solve
(0.376), this is a big difference indeed (even adding in the
0.104 sec for pysparse factorize).
This table shows the full results for the through-the-loop
test.
loop Py_time Tril_time Py_calls Tril_calls
--------------------------------------------------------
40 0.9207 1.2606 351377 415970
60 1.3964 2.0112 529737 648970
80 1.8653 2.7946 708097 881970
##################################
# 2.0) Some Test Code Details #
##################################
We would very much appreciate feedback from anyone with
a Trilinos/PyTrilinos installation who wish to test the per-
formance of their system using Xmesh1D. The code calls
only one solver (LinearLUSolver), but it can pick up either
the pysparse or the trilinos implementation through the use
of a command line flag, as illustrated a few lines below.
A typical invocation of the test code, Xmesh1D.py, looks
like one or these examples
# loop tol iters elem profile solveMode
./Xmesh1D.py 40 1e-10 10 400 1
or
./Xmesh1D.py 40 1e-10 10 400 1 --Trilinos
# loop tol iters elem profile solveMode
A quick look at the source will probably explain these para-
meters best of all, but briefly:
Loop the number of times "solve" is to be called, in order
to "iterate the problem to equilibrium", as the com-
ment in the original says. 40 was hard-wired into
the mesh1D diffusion example and is certainly the
"correct" value for solving the equation.
Tol the tolerance parameter passed along to solve. By
default, this is 1e-10 for both pysparse and trilinos.
Iters the iterations parameter passed along to solve. The
pysparse default is 10; for trilinos it is 5. We set
it explicitly to to insure a meaningful comparison.
Elem the number of mesh points. 400 was hard-wired in the
original code.
Profile whether or not to run the solve loop under the profiler
(we use cProfile in the Xmesh1D.py source). A "0"
means *_not_* to employ the profiler; a "1" causes the
profiler to be used, and the output to captured to a
file whose name is automatically concocted from the
parameters, in a way that will be obvious from the ex-
ample below.
SolveMode
by default, FiPy will run against its pysparse imple-
mentation; appending "--Trilinos" causes trilinos
to be used.
Entering the command above should cause the program's routine
output to appear on standard out and the profile statistics
to be written to a file named
mesh1D_TRIL_040_1e-10_010_00400.prf
You can of course use "tee" to capture standard out
./Xmesh1D.py 40 1e-10 10 400 1 --Trilinos | tee -a tril.dat
##############################
# 2.1) A Complete Example #
##############################
A bash script like this
#!/usr/bin/env bash
# timing info only
# loop tol iters elem profile solveMode
for i in {1..100}
do
./Xmesh1D.py 40 1e-10 10 400 0 | tee -a
pX100.dat
./Xmesh1D.py 40 1e-10 10 400 0 --Trilinos | tee -a
tX100.dat
done
should run the program 100 times, accumulating the results in the
respective .dat files. No profile data is collected. Unix utili-
ties would then allow you to extract the time from these output
files. For example,
grep time= pX100.dat | cut -d " " -f 3 | cut -d "=" -f 2 > p100
#############################
# 3.0) Time() Management #
#############################
The use of time() to gain timing information for a routine is, of
course, problematic. But then, so are all the other options. In
our tests we did gather profile statistics, but the profiler causes
it's own difficulties with interpretation. We believe that time(),
under some circumstances and in particular with repsect to the
way the Xmesh1D example has been constructed and run provides
meaningful performance data.
On an otherwise lightly loaded system time() has proven a very
stable measure of overall XMesh1D performance. We determined this
in 2 ways.
1) We timed()ed the cProfile calls. The reports from cProfile
for CPU time and the elapsed time figure from time() typi-
cally agree to within a few 1000ths of a sec (and were
*_always_* within 0.02 sec) for the many comparisons we
made. In other words, re-arranging the example so that
the timing loop encloses very little except the solver code
and conducting the tests on a 4-processor machine when the
computer was otherwise unutilized means that the time()ing
results directly measure differences in performance of the
implementations under investigation.
2) time() is usually quite stable. We conducted some time()
tests hundreds of times and looked at the scatter in the
result. The results of one such test were reported in 1.2)
above.
(iterations=10; tolerance=1e-10)
Solver mean std
name exec time
===============================
trilinos 1.263 sec 0.0028
pysparse 0.930 sec 0.0044
So, yes, there is an inevitable if probabilistic inaccuracy in time().
But these effects are quite small (considerably less than 1% for
an effect that is on the order of over 30%) and mitigations exist.
In other words, running the tests multiple times or using "larger"
examples on a relatively unutilized system renders negligible the
impact of this inaccuracy.
<A2-Py_profile.txt><A1-Xmesh1D.py><A3-Tril_profile.txt>