Thanks for working on this, Christophe, and sorry I missed the PR. You got further in fixing more things than I did though :). A couple of comments:
> For the vec_set<mode>_internal and neon_vld1_dup<mode> patterns, I > switched to an existing iterator which already had the needed > V4HF/V8HF (so I switched to VD_LANE and VQ2). It's a separate issue, and I hadn't done this either, but looking again - I don't see any reason why we shouldn't apply VD->VD_LANE to the vec_extract standard name pattern too. (At present looks like we have vec_extractv8hf but no vec_extractv4hf ?) > For neon_vdupn, I chose to implement neon_vdup_nv4hf and > neon_vdup_nv8hf instead of updating the VX iterator because I thought > it was not desirable to impact neon_vrev32<mode>. Well, the same instruction will suffice for vrev32'ing vectors of HF just as well as vectors of HI, so I think I'd argue that's harmless enough. To gain the benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though. > @@ -5252,12 +5252,22 @@ vget_lane_s32 (int32x2_t __a, const int __b) > were marked always-inline so there were no call sites, the declaration > would nonetheless raise an error. Hence, we must use a macro instead. */ > > + /* For big-endian, GCC's vector indices are the opposite way around > + to the architectural lane indices used by Neon intrinsics. */ Not quite the opposite way around, as you take into account yourself! 'Reversed within each 64 bits', perhaps? > +#ifdef __ARM_BIG_ENDIAN > + /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the > + right value for vectors with 8 lanes. */ > +#define __arm_lane(__vec, __idx) (__idx ^ 3) > +#else > +#define __arm_lane(__vec, __idx) __idx > +#endif > + Looks right, but sounds... my concern here is that I'm hoping at some point we will move the *other* vget/set_lane intrinsics to use GCC vector extensions too. At which time (unlike __aarch64_lane which can be used everywhere) this will be the wrong formula. Can we name (and/or comment) it to avoid misleading anyone? The key characteristic seems to be that it is for vectors of 16-bit elements only. > @@ -5334,7 +5344,7 @@ vgetq_lane_s32 (int32x4_t __a, const int __b) > ({ \ > float16x8_t __vec = (__v); \ > __builtin_arm_lane_check (8, __idx); \ > - float16_t __res = __vec[__idx]; \ > + float16_t __res = __vec[__arm_lane(__vec, __idx)]; \ In passing - the function name in the @@ header is of course misleading, this is #define vgetq_lane_f16 (and the later hunks) Thanks, Alan