https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88533

--- Comment #3 from Harald Anlauf <anlauf at gmx dot de> ---
(In reply to Thomas Koenig from comment #2)
> Strange.
> 
> I ran the code (with the data) a few times on my Ryzen 7 home
> system.
> 
> Here are some timings (run 10 times):
[snipped]
> Is it possible that the data size of the problem is just at
> the edge of cache size, so that (depending on what else happens
> on the system) it is possible to either get a lot of cache misses
> or not?

The matrix has ~ 250k real elements (~ 1MB), and the meta-data is
roughly the same size, so everything should easily fit into L3,
but probably not into L2.  (My machine reports 6 MB (L3?) cache).

I do not see significant variations in runtime on my system
(but I did usually average over 3 runs and made sure this machine
had no load), and certainly not those that you found.

I perused the vectorization report from Intel (it does heavy inlining
and seems to see the actual arguments) and found that the code can be
further optimized by declaring the array dummy arguments to subroutine
csc_times_vector as CONTIGUOUS, by adding after line 11:

  CONTIGUOUS :: a, ia, ja, x, y

This strongly reduces the runtime, e.g.:

baseline + -funroll-loops + CONTIGUOUS :

7: 0.74
8: 0.77
9: 0.73

(Now *that* is really good!  The same level as PGI on my machine.)

baseline + -funroll-loops -fcheck=bounds + CONTIGUOUS :

7: 1.50
8: 1.36
9: 1.63

(Note the drop in runtime for gcc-8)

baseline + -funroll-loops -fno-tree-ch -fcheck=bounds + CONTIGUOUS :

7: 1.52
8: 1.51
9: 1.52

Reply via email to