https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930
--- Comment #11 from Thomas Koenig <tkoenig at gcc dot gnu.org> --- A couple of points: First, the slow random number generation. While I do not understand why using the loop the way you do makes things slower with optimization, it is _much_ faster to generate random numbers in large chunks, as in call random_number(NU) call random_number(NV) Second, the optimization. With current trunk, you have to add statements to make sure that the optimizers do not notice you don't actually use your results :-) I added s_total = 0.0_dp ... do i = 1, i_max tp = TP_SUM(NU(:,i), P(1:4,1:4), NV(:,i)) s_total = s_total + sum(tp%vec) end do ... print *,s_total to the test cases so that the tests don't suddenly use zero CPU seconds. Third, you really have to look to what you are doing with your specific test cases, together with LTO and data analysis. Looking at your test case, your Tensor P is always the same. I don't know if this is representative of your problem or not. It has a huge effect on speed, because your routines are completely inlined (and unrolled) with -flto -Ofast. Not having to reload the data for P makes things much faster. Compare: ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -fno-inline tp_array_2.f90 ig25@linux-d6cw:~/Krempel/Tensor> ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 1.41199994 Using SUM, time: 0.888000011 Using MATMUL (L), time: 0.812000036 Using MATMUL (R), time: 0.895999908 2415021069.9784665 ig25@linux-d6cw:~/Krempel/Tensor> gfortran -march=native -Ofast -flto tp_array_2.f90 ig25@linux-d6cw:~/Krempel/Tensor> ./a.out This code variant uses intrinsic arrays to represent the contents of Type(Vect3D). Random Numbers, time: 1.41199994 Using SUM, time: 0.747999907 Using MATMUL (L), time: 0.132000208 Using MATMUL (R), time: 0.135999918