[Bug target/89101] New: [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

gael.guennebaud at gmail dot com Tue, 29 Jan 2019 04:57:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101


            Bug ID: 89101
           Summary: [Aarch64] vfmaq_laneq_f32 generates unnecessary dup
                    instrcutions
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gael.guennebaud at gmail dot com
  Target Milestone: ---

vfmaq_laneq_f32 is currently implemented as:

__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
                 float32x4_t __c, const int __lane)
{
  return __builtin_aarch64_fmav4sf (__b,
                                    __aarch64_vdupq_laneq_f32 (__c, __lane),
                                    __a);
}

thus leading to unoptimized code as:

        ldr     q1, [x2, 16]
        dup     v28.4s, v1.s[0]
        dup     v27.4s, v1.s[1]
        dup     v26.4s, v1.s[2]
        dup     v1.4s, v1.s[3]
        fmla    v22.4s, v25.4s, v28.4s
        fmla    v3.4s, v25.4s, v27.4s
        fmla    v6.4s, v25.4s, v26.4s
        fmla    v17.4s, v25.4s, v1.4s

instead of:

        ldr     q1, [x2, 16]
        fmla    v22.4s, v25.4s, v1.s[0]
        fmla    v3.4s, v25.4s, v1.s[1]
        fmla    v6.4s, v25.4s, v1.s[2]
        fmla    v17.4s, v25.4s, v1.s[3]

I guess several other *lane* intrinsics exhibit the same shortcoming.

For the record, I managed to partly workaround this issue by writing my own
version as:

         if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
(a), "w" (b) :  );
    else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
(a), "w" (b) :  );
    else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
(a), "w" (b) :  );
    else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
(a), "w" (b) :  );

but that's of course not ideal. This change yields a 32% speed up in Eigen's
matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633

[Bug target/89101] New: [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

Reply via email to