https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
Bug ID: 89101 Summary: [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gael.guennebaud at gmail dot com Target Milestone: --- vfmaq_laneq_f32 is currently implemented as: __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c, const int __lane) { return __builtin_aarch64_fmav4sf (__b, __aarch64_vdupq_laneq_f32 (__c, __lane), __a); } thus leading to unoptimized code as: ldr q1, [x2, 16] dup v28.4s, v1.s[0] dup v27.4s, v1.s[1] dup v26.4s, v1.s[2] dup v1.4s, v1.s[3] fmla v22.4s, v25.4s, v28.4s fmla v3.4s, v25.4s, v27.4s fmla v6.4s, v25.4s, v26.4s fmla v17.4s, v25.4s, v1.4s instead of: ldr q1, [x2, 16] fmla v22.4s, v25.4s, v1.s[0] fmla v3.4s, v25.4s, v1.s[1] fmla v6.4s, v25.4s, v1.s[2] fmla v17.4s, v25.4s, v1.s[3] I guess several other *lane* intrinsics exhibit the same shortcoming. For the record, I managed to partly workaround this issue by writing my own version as: if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" (a), "w" (b) : ); but that's of course not ideal. This change yields a 32% speed up in Eigen's matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633