https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #12 from Adam Hirst <adam at aphirst dot karoo.co.uk> ---
Created attachment 40940
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40940&action=edit
call graph of my "real" application

Thanks Thomas,

My "real" application is of course not using random numbers for the NU and NV,
but I will bear in mind the point about generating large chunks for the future.

I noticed too that enough optimisation flags resulted in an execution time of 0
seconds. I worked around it by writing all the results into an array,
evaluating the second "timing" variable, then asking for user input to specify
which result(s) to print.

In my "real" application, the Tensor P (or D, whatever I'm calling it this
week) is a 4x4 segment of a larger 'array' of Type(Vector), whose elements keep
varying (they're the control points of a B-Spline surface, and I'm more-or-less
doing shape optimisation on that surface).

The whole reason I was looking into this in the first place is that gprof
(along with useful plots by gprof2dot, one of which is attached) consistently
shows that it is this TensorProduct routine which BY FAR dominates. So my
options are either i) make it faster, or 2) need to call it less (which is more
a matter of algorithm design, and is a TODO for later investigation).

In any case, switching my TensorProduct routine to the one where the matmul()
and dot_product() are computed separately (though with no further array
temporaries, see one of my earlier comments in this thread) yielded the best
speed-up in my "real" application. Not as drastic as the reduced test case, but
still much more than a factor of two faster, whether building with -O2 or
-Ofast -flto.

Reply via email to