http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57988

            Bug ID: 57988
           Summary: missed optimization vxorpd before vcvtsi2sdq
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dushistov at mail dot ru

I tested such simple function on i7-3740QM CPU @ 2.70GHz, with gcc 4.8.1 and
gcc 4.9.0 20130725:

double pi(unsigned int count)
{
        unsigned int i;
        double p = 0;
        double z = 1;
        for (i = 1; i < count; i+=2) {
                p += z * 4 / i;
                z *= -1;
        }

        return p;
}

gcc(-Ofast -march=native -std=c99) convert cycle to such code:
...
30:
mov    %eax,%edx
vmulsd %xmm5,%xmm1,%xmm3
add    $0x2,%eax
vcvtsi2sd %rdx,%xmm2,%xmm2
cmp    %eax,%edi
vxorpd %xmm4,%xmm1,%xmm1
vdivsd %xmm2,%xmm3,%xmm2
vaddsd %xmm2,%xmm0,%xmm0
ja     30


avereage time 0.03sec if call like this pi(10000000),

if replace line "vcvtsi2sd %rdx,%xmm2,%xmm2" with two lines:
    vxorpd %xmm2,%xmm2,%xmm2
        vcvtsi2sd %rdx,%xmm2,%xmm2

then average time will be 0.011-0.013 secs, near 3 times faster.

PS icc generate such cycle:
22:
vxorpd %xmm5,%xmm5,%xmm5
vcvtsi2sd %rax,%xmm5,%xmm5
vmulsd %xmm2,%xmm1,%xmm4
vsubsd %xmm2,%xmm3,%xmm2
vdivsd %xmm5,%xmm4,%xmm6
add    $0x2,%eax
vaddsd %xmm6,%xmm0,%xmm0
cmp    %edi,%eax
jb     22

and average time 0.013sec

Reply via email to