SjoerdMeijer added inline comments.

================
Comment at: include/clang/Basic/arm_neon.td:1504
+  // Scalar floating point multiply extended (scalar, by element)
+  def SCALAR_FMULX_LANEH : IOpInst<"vmulx_lane", "ssdi", "Sh", 
OP_SCALAR_MUL_LN>;
+  def SCALAR_FMULX_LANEQH : IOpInst<"vmulx_laneq", "ssji", "Sh", 
OP_SCALAR_MUL_LN>;
----------------
az wrote:
> SjoerdMeijer wrote:
> > I found that unfortunately it's not that straightforward. This leads to 
> > wrong code generation as it is generating a fmul instead of fmulx. I am 
> > suspecting this instruction description should be using OP_SCALAR_MULX_LN, 
> > but also the type decls are wrong. Need to dig a bit further here.
> Sorry for confusion as the commented code was never intended to be used and 
> it is a copy of the code for the intrinsic vmulh_lane(). It was done that way 
> in order to point out that vmulh_lane() and vmulxh_lane() intrinsics should 
> be implemented in a similar way. The only useful thing in the commented code 
> is the explanation that we need the scalar intrinsic vmulxh_f16() which was 
> implemented in the scalar intrinsic patch later on.
>  
> If we look at how vmulh_lane (a, b, lane) is implemented:
> x = extract (b, lane);
> res = a * x;
> return res; 
> 
> Similarly, I thought at the time that vmulxh_lane (a, b, lane) can be 
> implemented: 
> x = extract (b, lane);
> res = vmulxh_f16 (a, x);      // no llvm native mulx instruction, so we use 
> the fp16 scalar intrinsic.
> return res; 
> 
> I am not sure now that we can easily use scalar intrinsic while generating 
> the arm_neon.h file. In case we can not do that, I am thinking that the 
> frontend should generate a new builtin for intrinsic vmulxh_lane() that the 
> backend recognizes and generate the right code for it which is fmulx  h0, h0, 
> v1.h[lane]. If you made or will be making progress on this, then that is 
> great. Otherwise, I can look at a frontend solution for it.  
Hi Abderrazek,
Thanks for the clarifications!  And I agree with your observations.
This simple changed looked to do the right thing, because as you also said, 
this vmulx is just an extract and a multiply, but then it was incorrectly 
generating a fmul which should be a fmulx. I briefly looked at fixing this, but 
also didn't see how I could use the scalar intrinsic here. Looks like passing a 
builtin is indeed the best thing, also because fmulx is instruction selected 
based on a intrinsic:

  defm FMULX    : SIMDThreeSameVectorFP<0,0,0b011,"fmulx", 
int_aarch64_neon_fmulx>;

If you have the bandwidth to pick this up, that would be great; I started 
looking into the other failing AArch64 vector intrinsics.

Cheers,
Sjoerd.


https://reviews.llvm.org/D44222



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to