https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343
Bug ID: 97343
Summary: AVX2 vectorizer generates extremely strange and slow
code for AoSoA complex dot product
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: already5chosen at yahoo dot com
Target Milestone: ---
Let's continue our complex dot product series started here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854
This time I have no code generation bugs for your pleasure, just "interesting"
optimization issues.
All examples below, unless stated otherwise were compiled with gcc.10.2 for
x86-64 with following sets
of flags:
set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math
set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math
The kernel in question is an example of complex dot product in so-called
"hybrid AoS" layout a.k.a. AoSoA.
https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays
In my experience it's quite rare when in dense complex linear algebra and
similar computational fields AoSoA is *not* the optimal internal form.
So, practically, I consider these kernels as more important than AoS kernel
presented in bug 96854.
More specifically, the layout can be described as struct { double re[4], im[4];
};
But for purpose of simplicity I omitted the type definition fro code examples
and coded it directly over flat arrays of doubles.
Part 1.
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_re = 0;
double acc_im = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_re;
res[4] = acc_im;
}
That's how we want to code it in the ideal world and let to compiles to care
about dirty details.
In less ideal world, gcc is not the only compiler that can't cope with it.
MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc
generates code that it's not quite bad, but somewhat suboptimal.
So, let's it pass. I don't want to blame gcc for not being smart enough. It's
just normal.
Except that when we use set2 the code generated by gcc becomes not just
non-smart, but quite crazy.
I am ignoring it in the hope that it would be magically fixed by the change
made by Richard Biener 2020-08-31
Part 2.
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_rere = 0;
double acc_imim = 0;
double acc_reim = 0;
double acc_imre = 0;
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere += a[c*8+k+0]*b[c*8+k+0];
acc_imim += a[c*8+k+4]*b[c*8+k+4];
acc_reim += a[c*8+k+0]*b[c*8+k+4];
acc_imre += a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = acc_rere+acc_imim;
res[4] = acc_imre-acc_reim;
}
We are explaining it to compiler slowly.
For icc and MSVC it's enough. They understood.
icc generates near-perfect code. I can write it nicer, but do not expect my
variant to be any faster.
MSVC generates near-perfect inner loop and epilogue that is not great, but not
really much slower.
gcc still didn't get it. It still tries to implement 4 accumulators literally,
as if -ffast-math were not here.
But, sad as it is, it's still a case of not being smart enough. So, I am not
complaining.
Part 3.
static inline double sum4(double x[]) {
return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_re[4] = {0};
double acc_im[4] = {0};
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
}
}
res[0] = sum4(acc_re);
res[4] = sum4(acc_im);
}
Attempt to feed compiler by teaspoon.
That's not a way I want to write code in HLL.
icc copes, producing about the same code as in Part 1
MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal
scalar code with local arrays on stack.
gcc with set1 is a little better than MSVC - the code is fully scalar, but at
least accumulators kept in registers.
gcc with set2 is the most interesting. It vectorizes, but how?
Here is an inner loop:
.L3:
vpermpd $27, (%r8,%rax), %ymm2
vpermpd $27, 32(%rdx,%rax), %ymm3
vpermpd $27, (%rdx,%rax), %ymm1
vpermpd $27, 32(%r8,%rax), %ymm0
vmulpd %ymm2, %ymm1, %ymm6
vmulpd %ymm2, %ymm3, %ymm2
addq $64, %rax
vfnmadd132pd %ymm0, %ymm2, %ymm1
vfmadd132pd %ymm3, %ymm6, %ymm0
vaddpd %ymm1, %ymm5, %ymm5
vaddpd %ymm0, %ymm4, %ymm4
cmpq %rcx, %rax
jne .L3
What all this vpermpd business about? Shuffling SIMD lanes around just because
it's funny?
That the first thing I do want to complain about. Not "not smart enough", but
too smart for its own good.
And finally
Part 4
static inline double sum4(double x[]) {
return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
double acc_rere[4] = {0};
double acc_imim[4] = {0};
double acc_reim[4] = {0};
double acc_imre[4] = {0};
for (int c = 0; c < N; ++c) {
for (int k = 0; k < 4; ++k) {
acc_rere[k] += a[c*8+k+0]*b[c*8+k+0];
acc_imim[k] += a[c*8+k+4]*b[c*8+k+4];
acc_reim[k] += a[c*8+k+0]*b[c*8+k+4];
acc_imre[k] += a[c*8+k+4]*b[c*8+k+0];
}
}
double acc_re[4];
double acc_im[4];
for (int k = 0; k < 4; ++k) {
acc_re[k] = acc_rere[k]+acc_imim[k];
acc_im[k] = acc_imre[k]-acc_reim[k];
}
res[0] = sum4(acc_re);
res[4] = sum4(acc_im);
}
Not just fed by teaspoon, but compiler's mouth held open manually, so to speak.
icc, of course, understands and generates pretty much the same good code as in
Part 2.
MSVC, of course, does not understand and generates arrays on stack.
gcc with set2, of course, continues to enjoy a juggling. Doubling or tripling
up vs last time's performance.
Inner loop:
.L3:
vmovupd (%rdx,%rax), %ymm1
vmovupd 32(%rdx,%rax), %ymm0
vmovupd 32(%r8,%rax), %ymm5
vperm2f128 $49, %ymm1, %ymm0, %ymm3
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vpermpd $221, %ymm0, %ymm10
vpermpd $136, %ymm0, %ymm1
vmovupd (%r8,%rax), %ymm0
vpermpd $136, %ymm3, %ymm9
vperm2f128 $49, %ymm5, %ymm0, %ymm2
vinsertf128 $1, %xmm5, %ymm0, %ymm0
vpermpd $40, %ymm2, %ymm11
vpermpd $125, %ymm0, %ymm5
vpermpd $221, %ymm3, %ymm3
vpermpd $40, %ymm0, %ymm0
vpermpd $125, %ymm2, %ymm2
addq $64, %rax
vfmadd231pd %ymm2, %ymm3, %ymm8
vfmadd231pd %ymm11, %ymm9, %ymm6
vfmadd231pd %ymm5, %ymm10, %ymm7
vfmadd231pd %ymm0, %ymm1, %ymm4
cmpq %rax, %rcx
jne .L3
But this time gcc with set1 was a real star of the show. My only reaction is
"What?"
.L4:
vmovupd 0(%r13), %ymm5
vmovupd 64(%r13), %ymm7
vmovupd 192(%r13), %ymm4
vmovupd 128(%r13), %ymm6
vunpcklpd 32(%r13), %ymm5, %ymm13
vunpckhpd 32(%r13), %ymm5, %ymm12
vunpckhpd 96(%r13), %ymm7, %ymm1
vunpcklpd 96(%r13), %ymm7, %ymm5
vmovupd 128(%r13), %ymm7
vunpcklpd 224(%r13), %ymm4, %ymm2
vunpcklpd 160(%r13), %ymm6, %ymm6
vunpckhpd 160(%r13), %ymm7, %ymm11
vunpckhpd 224(%r13), %ymm4, %ymm0
vpermpd $216, %ymm13, %ymm13
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm5, %ymm5
vunpcklpd %ymm2, %ymm6, %ymm4
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm11, %ymm11
vunpcklpd %ymm5, %ymm13, %ymm9
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm4, %ymm3
vpermpd $216, %ymm9, %ymm9
vunpckhpd %ymm2, %ymm6, %ymm4
vunpckhpd %ymm5, %ymm13, %ymm5
vunpcklpd %ymm1, %ymm12, %ymm6
vunpcklpd %ymm0, %ymm11, %ymm2
vunpckhpd %ymm1, %ymm12, %ymm12
vunpckhpd %ymm0, %ymm11, %ymm0
vpermpd $216, %ymm12, %ymm1
vunpcklpd %ymm3, %ymm9, %ymm11
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm4, %ymm4
vpermpd $216, %ymm0, %ymm0
vmovupd 64(%r12), %ymm15
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm11, %ymm8
vpermpd $216, %ymm2, %ymm2
vunpcklpd %ymm4, %ymm5, %ymm11
vunpckhpd %ymm3, %ymm9, %ymm9
vunpckhpd %ymm4, %ymm5, %ymm4
vunpcklpd %ymm0, %ymm1, %ymm5
vpermpd $216, %ymm9, %ymm3
vunpcklpd %ymm2, %ymm6, %ymm9
vunpckhpd %ymm2, %ymm6, %ymm2
vpermpd $216, %ymm5, %ymm6
vunpcklpd 96(%r12), %ymm15, %ymm12
vunpckhpd %ymm0, %ymm1, %ymm0
vmovupd %ymm6, 64(%rsp)
vunpckhpd 96(%r12), %ymm15, %ymm6
vmovupd 128(%r12), %ymm15
vpermpd $216, %ymm0, %ymm5
vpermpd $216, %ymm9, %ymm7
vmovupd (%r12), %ymm0
vunpckhpd 160(%r12), %ymm15, %ymm9
vmovupd %ymm5, 96(%rsp)
vunpcklpd 160(%r12), %ymm15, %ymm5
vmovupd 192(%r12), %ymm15
vunpcklpd 32(%r12), %ymm0, %ymm1
vpermpd $216, %ymm9, %ymm14
vunpcklpd 224(%r12), %ymm15, %ymm9
vunpckhpd 224(%r12), %ymm15, %ymm13
vunpckhpd 32(%r12), %ymm0, %ymm0
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm9, %ymm9
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm6, %ymm6
vunpcklpd %ymm12, %ymm1, %ymm10
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm13, %ymm13
vunpckhpd %ymm12, %ymm1, %ymm1
vunpcklpd %ymm9, %ymm5, %ymm12
vpermpd $216, %ymm12, %ymm12
vpermpd $216, %ymm10, %ymm10
vunpckhpd %ymm9, %ymm5, %ymm5
vunpcklpd %ymm6, %ymm0, %ymm9
vunpckhpd %ymm6, %ymm0, %ymm0
vunpcklpd %ymm13, %ymm14, %ymm6
vunpckhpd %ymm13, %ymm14, %ymm13
vpermpd $216, %ymm13, %ymm14
vunpcklpd %ymm12, %ymm10, %ymm13
vpermpd $216, %ymm13, %ymm13
vmulpd %ymm13, %ymm8, %ymm15
vpermpd $216, %ymm5, %ymm5
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm9, %ymm9
vpermpd $216, %ymm0, %ymm0
vunpckhpd %ymm12, %ymm10, %ymm10
vunpcklpd %ymm6, %ymm9, %ymm12
vunpckhpd %ymm6, %ymm9, %ymm9
vunpcklpd %ymm5, %ymm1, %ymm6
vunpckhpd %ymm5, %ymm1, %ymm1
vunpcklpd %ymm14, %ymm0, %ymm5
vunpckhpd %ymm14, %ymm0, %ymm0
vpermpd $216, %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vmovq %r9, %xmm0
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm14
vpermpd $216, %ymm10, %ymm10
vaddsd %xmm14, %xmm0, %xmm0
vextractf128 $0x1, %ymm15, %xmm14
vmulpd %ymm10, %ymm8, %ymm8
vaddsd %xmm14, %xmm0, %xmm15
vunpckhpd %xmm14, %xmm14, %xmm14
vpermpd $216, %ymm12, %ymm12
vaddsd %xmm14, %xmm15, %xmm0
vmulpd %ymm10, %ymm3, %ymm15
vunpckhpd %xmm8, %xmm8, %xmm10
vmovq %xmm0, %r9
vmovq %rcx, %xmm0
vmulpd %ymm13, %ymm3, %ymm3
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm14
vextractf128 $0x1, %ymm15, %xmm15
vaddsd %xmm14, %xmm0, %xmm14
vpermpd $216, %ymm1, %ymm1
vmovupd %ymm1, 128(%rsp)
vaddsd %xmm15, %xmm14, %xmm14
vunpckhpd %xmm15, %xmm15, %xmm15
vpermpd $216, %ymm2, %ymm2
vaddsd %xmm15, %xmm14, %xmm0
vmovsd 56(%rsp), %xmm14
vpermpd $216, %ymm9, %ymm9
vaddsd %xmm8, %xmm14, %xmm14
vextractf128 $0x1, %ymm8, %xmm8
vmovq %xmm0, %rcx
vaddsd %xmm10, %xmm14, %xmm10
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm11, %ymm11
vaddsd %xmm8, %xmm10, %xmm10
vunpckhpd %xmm8, %xmm8, %xmm8
vpermpd $216, %ymm4, %ymm4
vaddsd %xmm8, %xmm10, %xmm0
vmovsd 48(%rsp), %xmm10
vunpckhpd %xmm3, %xmm3, %xmm8
vaddsd %xmm3, %xmm10, %xmm10
vextractf128 $0x1, %ymm3, %xmm3
vmovsd %xmm0, 56(%rsp)
vaddsd %xmm8, %xmm10, %xmm8
vmulpd %ymm12, %ymm7, %ymm10
vmulpd %ymm9, %ymm7, %ymm7
vaddsd %xmm3, %xmm8, %xmm8
vunpckhpd %xmm3, %xmm3, %xmm3
vpermpd $216, %ymm5, %ymm5
vaddsd %xmm3, %xmm8, %xmm0
vunpckhpd %xmm10, %xmm10, %xmm3
addq $256, %r12
vmovsd %xmm0, 48(%rsp)
vmovq %rdi, %xmm0
vaddsd %xmm10, %xmm0, %xmm8
vextractf128 $0x1, %ymm10, %xmm10
vmovq %rbx, %xmm0
vaddsd %xmm3, %xmm8, %xmm3
vmulpd %ymm9, %ymm2, %ymm8
vmulpd %ymm12, %ymm2, %ymm2
vaddsd %xmm10, %xmm3, %xmm3
vunpckhpd %xmm10, %xmm10, %xmm10
addq $256, %r13
vaddsd %xmm10, %xmm3, %xmm1
vaddsd %xmm8, %xmm0, %xmm10
vunpckhpd %xmm8, %xmm8, %xmm3
vextractf128 $0x1, %ymm8, %xmm8
vaddsd %xmm3, %xmm10, %xmm3
vmovq %xmm1, %rdi
vmovq %r11, %xmm1
vaddsd %xmm8, %xmm3, %xmm3
vunpckhpd %xmm8, %xmm8, %xmm8
vmovq %r10, %xmm0
vaddsd %xmm8, %xmm3, %xmm3
vmovsd 40(%rsp), %xmm8
vaddsd %xmm7, %xmm8, %xmm8
vmovq %xmm3, %rbx
vunpckhpd %xmm7, %xmm7, %xmm3
vaddsd %xmm3, %xmm8, %xmm3
vextractf128 $0x1, %ymm7, %xmm7
vaddsd %xmm7, %xmm3, %xmm3
vunpckhpd %xmm7, %xmm7, %xmm7
vaddsd %xmm7, %xmm3, %xmm3
vmovsd 32(%rsp), %xmm7
vaddsd %xmm2, %xmm7, %xmm7
vmovsd %xmm3, 40(%rsp)
vunpckhpd %xmm2, %xmm2, %xmm3
vaddsd %xmm3, %xmm7, %xmm3
vextractf128 $0x1, %ymm2, %xmm2
vmulpd %ymm6, %ymm11, %ymm7
vaddsd %xmm2, %xmm3, %xmm3
vunpckhpd %xmm2, %xmm2, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vaddsd %xmm7, %xmm1, %xmm3
vmovupd 128(%rsp), %ymm1
vmovsd %xmm2, 32(%rsp)
vunpckhpd %xmm7, %xmm7, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vextractf128 $0x1, %ymm7, %xmm7
vmulpd %ymm1, %ymm4, %ymm3
vaddsd %xmm7, %xmm2, %xmm2
vunpckhpd %xmm7, %xmm7, %xmm7
vmulpd %ymm1, %ymm11, %ymm1
vaddsd %xmm7, %xmm2, %xmm2
vaddsd %xmm3, %xmm0, %xmm7
vmulpd %ymm6, %ymm4, %ymm4
vmovq %xmm2, %r11
vunpckhpd %xmm3, %xmm3, %xmm2
vaddsd %xmm2, %xmm7, %xmm2
vextractf128 $0x1, %ymm3, %xmm3
vmovupd 64(%rsp), %ymm6
vaddsd %xmm3, %xmm2, %xmm2
vunpckhpd %xmm3, %xmm3, %xmm3
vmovupd 96(%rsp), %ymm7
vaddsd %xmm3, %xmm2, %xmm2
vmovsd 24(%rsp), %xmm3
vmovupd 160(%rsp), %ymm0
vaddsd %xmm1, %xmm3, %xmm3
vmovq %xmm2, %r10
vunpckhpd %xmm1, %xmm1, %xmm2
vaddsd %xmm2, %xmm3, %xmm2
vextractf128 $0x1, %ymm1, %xmm1
vmovq %rbp, %xmm3
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm4, %xmm4, %xmm1
vmovsd %xmm2, 24(%rsp)
vmovsd 16(%rsp), %xmm2
vaddsd %xmm4, %xmm2, %xmm2
vextractf128 $0x1, %ymm4, %xmm4
vaddsd %xmm1, %xmm2, %xmm1
vaddsd %xmm4, %xmm1, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm4
vaddsd %xmm4, %xmm1, %xmm4
vmovsd %xmm4, 16(%rsp)
vmulpd %ymm6, %ymm5, %ymm4
vmulpd %ymm7, %ymm5, %ymm5
vaddsd %xmm4, %xmm3, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm2
vmovq %rsi, %xmm3
vaddsd %xmm2, %xmm1, %xmm2
vextractf128 $0x1, %ymm4, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm4
vmovq %xmm4, %rbp
vmulpd %ymm0, %ymm7, %ymm4
vmulpd %ymm0, %ymm6, %ymm0
vaddsd %xmm4, %xmm3, %xmm1
vunpckhpd %xmm4, %xmm4, %xmm2
vaddsd %xmm2, %xmm1, %xmm2
vextractf128 $0x1, %ymm4, %xmm1
vaddsd %xmm1, %xmm2, %xmm2
vunpckhpd %xmm1, %xmm1, %xmm1
vaddsd %xmm1, %xmm2, %xmm4
vmovsd 8(%rsp), %xmm2
vunpckhpd %xmm0, %xmm0, %xmm1
vaddsd %xmm0, %xmm2, %xmm2
vextractf128 $0x1, %ymm0, %xmm0
vmovq %xmm4, %rsi
vaddsd %xmm1, %xmm2, %xmm1
vaddsd %xmm0, %xmm1, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm1, %xmm6
vmovsd (%rsp), %xmm1
vunpckhpd %xmm5, %xmm5, %xmm0
vaddsd %xmm5, %xmm1, %xmm1
vextractf128 $0x1, %ymm5, %xmm5
vmovsd %xmm6, 8(%rsp)
vaddsd %xmm0, %xmm1, %xmm0
vaddsd %xmm5, %xmm0, %xmm0
vunpckhpd %xmm5, %xmm5, %xmm5
vaddsd %xmm5, %xmm0, %xmm5
vmovsd %xmm5, (%rsp)
cmpq %rax, %r12
jne .L4
movl %r15d, %r12d
andl $-4, %r12d
movl %r12d, %edx
cmpl %r12d, %r15d
je .L5
.L3:
movl %r15d, %eax
subl %r12d, %eax
cmpl $1, %eax
je .L6
salq $6, %r12
leaq (%r14,%r12), %r13
vmovupd 16(%r13), %xmm3
vmovupd 48(%r13), %xmm0
vmovupd 64(%r13), %xmm8
vmovupd 112(%r13), %xmm10
vmovupd 0(%r13), %xmm4
vmovupd 32(%r13), %xmm2
vmovupd 80(%r13), %xmm6
vmovupd 96(%r13), %xmm1
vunpcklpd %xmm3, %xmm4, %xmm5
vunpckhpd %xmm3, %xmm4, %xmm4
vunpcklpd %xmm0, %xmm2, %xmm3
vunpckhpd %xmm0, %xmm2, %xmm2
vunpcklpd %xmm6, %xmm8, %xmm0
vunpckhpd %xmm6, %xmm8, %xmm6
vunpcklpd %xmm10, %xmm1, %xmm8
vunpckhpd %xmm10, %xmm1, %xmm1
vunpcklpd %xmm3, %xmm5, %xmm11
vunpcklpd %xmm2, %xmm4, %xmm10
vunpckhpd %xmm3, %xmm5, %xmm3
vunpckhpd %xmm2, %xmm4, %xmm2
vunpcklpd %xmm8, %xmm0, %xmm5
vunpcklpd %xmm1, %xmm6, %xmm4
vunpckhpd %xmm8, %xmm0, %xmm0
vunpckhpd %xmm1, %xmm6, %xmm1
addq %r8, %r12
vunpcklpd %xmm5, %xmm11, %xmm8
vunpckhpd %xmm0, %xmm3, %xmm7
vunpckhpd %xmm5, %xmm11, %xmm11
vunpckhpd %xmm1, %xmm2, %xmm5
vmovupd 64(%r12), %xmm12
vunpcklpd %xmm1, %xmm2, %xmm6
vmovupd 80(%r12), %xmm9
vmovupd 48(%r12), %xmm1
vmovupd 96(%r12), %xmm2
vunpcklpd %xmm4, %xmm10, %xmm14
vunpcklpd %xmm0, %xmm3, %xmm13
vunpckhpd %xmm4, %xmm10, %xmm10
vmovupd 32(%r12), %xmm3
vmovupd 16(%r12), %xmm4
vmovapd %xmm7, 64(%rsp)
vmovapd %xmm5, 96(%rsp)
vmovupd 112(%r12), %xmm7
vmovupd (%r12), %xmm5
movl %eax, %r12d
vunpcklpd %xmm4, %xmm5, %xmm15
vunpckhpd %xmm4, %xmm5, %xmm5
vunpcklpd %xmm1, %xmm3, %xmm4
vunpckhpd %xmm1, %xmm3, %xmm3
vunpcklpd %xmm9, %xmm12, %xmm1
vunpckhpd %xmm9, %xmm12, %xmm9
vunpcklpd %xmm7, %xmm2, %xmm12
vunpckhpd %xmm7, %xmm2, %xmm2
vunpcklpd %xmm4, %xmm15, %xmm7
vunpckhpd %xmm4, %xmm15, %xmm15
vunpcklpd %xmm12, %xmm1, %xmm4
vunpckhpd %xmm12, %xmm1, %xmm1
vunpcklpd %xmm3, %xmm5, %xmm12
vunpckhpd %xmm3, %xmm5, %xmm5
vunpcklpd %xmm2, %xmm9, %xmm3
vunpckhpd %xmm2, %xmm9, %xmm2
vunpcklpd %xmm4, %xmm7, %xmm9
vunpckhpd %xmm1, %xmm15, %xmm0
vunpckhpd %xmm4, %xmm7, %xmm4
vunpcklpd %xmm3, %xmm12, %xmm7
vunpckhpd %xmm3, %xmm12, %xmm3
vunpcklpd %xmm1, %xmm15, %xmm12
vunpcklpd %xmm2, %xmm5, %xmm15
vunpckhpd %xmm2, %xmm5, %xmm2
vmulpd %xmm9, %xmm8, %xmm5
vmovapd %xmm0, 128(%rsp)
vmovq %r9, %xmm0
andl $-2, %r12d
addl %r12d, %edx
vaddsd %xmm5, %xmm0, %xmm0
vunpckhpd %xmm5, %xmm5, %xmm5
vaddsd %xmm5, %xmm0, %xmm1
vmulpd %xmm4, %xmm11, %xmm5
vmulpd %xmm4, %xmm8, %xmm4
vmovq %xmm1, %r9
vmovq %rcx, %xmm1
vmulpd %xmm9, %xmm11, %xmm11
vaddsd %xmm5, %xmm1, %xmm1
vunpckhpd %xmm5, %xmm5, %xmm5
vmulpd %xmm7, %xmm14, %xmm9
vaddsd %xmm5, %xmm1, %xmm1
vmovsd 56(%rsp), %xmm5
vmulpd %xmm3, %xmm10, %xmm8
vaddsd %xmm4, %xmm5, %xmm5
vunpckhpd %xmm4, %xmm4, %xmm4
vmovq %xmm1, %rcx
vaddsd %xmm4, %xmm5, %xmm4
vmovq %rdi, %xmm1
vmulpd %xmm3, %xmm14, %xmm14
vmovsd %xmm4, 56(%rsp)
vmovsd 48(%rsp), %xmm4
vmovq %rbx, %xmm0
vaddsd %xmm11, %xmm4, %xmm4
vunpckhpd %xmm11, %xmm11, %xmm11
vmovsd 40(%rsp), %xmm3
vaddsd %xmm11, %xmm4, %xmm4
vmulpd %xmm7, %xmm10, %xmm10
vaddsd %xmm14, %xmm3, %xmm3
vmovsd %xmm4, 48(%rsp)
vaddsd %xmm9, %xmm1, %xmm4
vunpckhpd %xmm9, %xmm9, %xmm9
vunpckhpd %xmm14, %xmm14, %xmm14
vaddsd %xmm9, %xmm4, %xmm4
vmovapd 128(%rsp), %xmm5
vmovapd 64(%rsp), %xmm11
vmovq %xmm4, %rdi
vaddsd %xmm8, %xmm0, %xmm4
vunpckhpd %xmm8, %xmm8, %xmm8
vmovsd 24(%rsp), %xmm1
vaddsd %xmm8, %xmm4, %xmm4
vmovsd 16(%rsp), %xmm0
vmovq %xmm4, %rbx
vaddsd %xmm14, %xmm3, %xmm4
vmovsd 32(%rsp), %xmm3
vaddsd %xmm10, %xmm3, %xmm3
vunpckhpd %xmm10, %xmm10, %xmm10
vmovsd %xmm4, 40(%rsp)
vaddsd %xmm10, %xmm3, %xmm7
vmulpd %xmm12, %xmm13, %xmm3
vmulpd %xmm5, %xmm13, %xmm13
vmovsd %xmm7, 32(%rsp)
vmovq %r11, %xmm7
vmulpd %xmm11, %xmm12, %xmm12
vaddsd %xmm3, %xmm7, %xmm4
vunpckhpd %xmm3, %xmm3, %xmm3
vaddsd %xmm13, %xmm1, %xmm1
vaddsd %xmm3, %xmm4, %xmm7
vmulpd %xmm5, %xmm11, %xmm3
vunpckhpd %xmm13, %xmm13, %xmm13
vmovq %xmm7, %r11
vmovq %r10, %xmm7
vaddsd %xmm12, %xmm0, %xmm0
vaddsd %xmm3, %xmm7, %xmm4
vunpckhpd %xmm3, %xmm3, %xmm3
vunpckhpd %xmm12, %xmm12, %xmm12
vaddsd %xmm3, %xmm4, %xmm7
vaddsd %xmm13, %xmm1, %xmm4
vmovq %xmm7, %r10
vmovsd %xmm4, 24(%rsp)
vaddsd %xmm12, %xmm0, %xmm4
vmulpd %xmm15, %xmm6, %xmm0
vmovq %rbp, %xmm7
vmovsd %xmm4, 16(%rsp)
vmovapd 96(%rsp), %xmm5
vaddsd %xmm0, %xmm7, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vmovq %rsi, %xmm7
vaddsd %xmm0, %xmm1, %xmm4
vmulpd %xmm5, %xmm2, %xmm0
vmulpd %xmm2, %xmm6, %xmm2
vmovq %xmm4, %rbp
vmulpd %xmm5, %xmm15, %xmm15
vaddsd %xmm0, %xmm7, %xmm1
vunpckhpd %xmm0, %xmm0, %xmm0
vaddsd %xmm0, %xmm1, %xmm4
vmovsd 8(%rsp), %xmm0
vaddsd %xmm2, %xmm0, %xmm0
vunpckhpd %xmm2, %xmm2, %xmm2
vmovq %xmm4, %rsi
vaddsd %xmm2, %xmm0, %xmm6
vmovsd (%rsp), %xmm0
vaddsd %xmm15, %xmm0, %xmm0
vunpckhpd %xmm15, %xmm15, %xmm15
vmovsd %xmm6, 8(%rsp)
vaddsd %xmm15, %xmm0, %xmm5
vmovsd %xmm5, (%rsp)
cmpl %r12d, %eax
je .L5