Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

edgar Fri, 18 Jun 2021 19:54:56 -0700

On 2021-06-18 21:45, John Peterson wrote:

On Thu, Jun 10, 2021 at 5:55 PM edgar <edgar...@cryptolab.net> wrote:

On 2021-06-10 19:27, John Peterson wrote:
> I recorded the "Active time" for the "Matrix Assembly Performance"
> PerfLog
> in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
> original codepath and your proposed change, averaging the results over
> 5
> runs. The results were:
>
> Original code, "./example-opt -d 3 -n 40"
> import numpy as np
> np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
>
> Patch, "./example-opt -d 3 -n 40"
> import numpy as np
> np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
>
> so I'd say the original code path is marginally (but still
> statistically
> significantly) faster, although keep in mind that matrix assembly is
> only
> about 21% of the total time for this example while the solve is about
> 71%.

Superinteresting, I am sending you my benchmarks. I must say that Ihad

initially run only 2 benchmarks, and both came out faster with the
modifications. Now, I found that
- The original code is more efficient with `-n 40'
- The modified code is more efficient with `-n 15' and `mpirun -np 4'

- That I ran the 5-test trial several times and some times, theoriginalcode was more efficient with `-n 15', but the first and second runwith

the modified code were always faster (my computer heating up?)

The gains are really marginal in any case. It would be interesting to
run with -O3... (I just did [1]).

It seems that the differences are now a little bit more substantial,and

that the modified code would be faster. I hope not to have made any
mistakes.

The code and the benchmarks are in the attached file.
- examples
|- introduction
  |- ex4                    (original code)
   |- output_*_.txt.bz2     (running -n 40 with -O2)
   |- output_15_*_.txt.bz2     (running -n 15 with -O2)
   |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
  |- ex4_mod                (modified code)
   |- output_*_.txt.bz2     (running -n 40 with -O2)
   |- output_15_*_.txt.bz2     (running -n 15 with -O2)
   |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)


[1] I manually compiled like this (added -O3 instead of -O2; disregard
the CCFLAGS et al):

     $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3

Your compiler flags are definitely far more advanced/aggressive thanmine,

which are just on the default of -O2. However, I think what we should

conclude from your results is that there is something slower than itneedsto be with DenseMatrix::resize(), not that we should move theDenseMatrix

creation/destruction inside the loop over elements. What I tried (see

attached patch or the "dense_matrix_resize_no_virtual" branch in myfork)

is avoiding the virtual function call to DenseMatrix::zero() which is

currently made from DenseMatrix::resize(). In my testing, this changedidnot seem to make much of a difference but I'm curious about what youwould

get with your compiler args, this patch, and the unpatched ex4.

I will surely test it. I will have more time next week. Sorry for thedelay.



_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Re: [Libmesh-users] small doc and efficiency update for ex 3 and 4

Reply via email to