Thanks for working on this, Christophe, and sorry I missed the PR. You got
further in fixing more things than I did though :). A couple of comments:

> For the vec_set<mode>_internal and neon_vld1_dup<mode> patterns, I
> switched to an existing iterator which already had the needed
> V4HF/V8HF (so I switched to VD_LANE and VQ2).

It's a separate issue, and I hadn't done this either, but looking again - I
don't see any reason why we shouldn't apply VD->VD_LANE to the vec_extract
standard name pattern too. (At present looks like we have vec_extractv8hf but no
vec_extractv4hf ?)

> For neon_vdupn, I chose to implement neon_vdup_nv4hf and
> neon_vdup_nv8hf instead of updating the VX iterator because I thought
> it was not desirable to impact neon_vrev32<mode>.

Well, the same instruction will suffice for vrev32'ing vectors of HF just as
well as vectors of HI, so I think I'd argue that's harmless enough. To gain the
benefit, we'd need to update arm_evpc_neon_vrev with a few new cases, though.

> @@ -5252,12 +5252,22 @@ vget_lane_s32 (int32x2_t __a, const int __b)
>     were marked always-inline so there were no call sites, the declaration
>     would nonetheless raise an error.  Hence, we must use a macro instead.  */
>
> +  /* For big-endian, GCC's vector indices are the opposite way around
> +     to the architectural lane indices used by Neon intrinsics.  */

Not quite the opposite way around, as you take into account yourself! 'Reversed
within each 64 bits', perhaps?

> +#ifdef __ARM_BIG_ENDIAN
> +  /* Here, 3 is (4-1) where 4 is the number of lanes. This is also the
> +     right value for vectors with 8 lanes.  */
> +#define __arm_lane(__vec, __idx) (__idx ^ 3)
> +#else
> +#define __arm_lane(__vec, __idx) __idx
> +#endif
> +

Looks right, but sounds... my concern here is that I'm hoping at some point we
will move the *other* vget/set_lane intrinsics to use GCC vector extensions
too. At which time (unlike __aarch64_lane which can be used everywhere) this
will be the wrong formula. Can we name (and/or comment) it to avoid misleading
anyone? The key characteristic seems to be that it is for vectors of 16-bit
elements only.

> @@ -5334,7 +5344,7 @@ vgetq_lane_s32 (int32x4_t __a, const int __b)
>      ({                                               \
>        float16x8_t __vec = (__v);             \
>        __builtin_arm_lane_check (8, __idx);   \
> -      float16_t __res = __vec[__idx];                \
> +      float16_t __res = __vec[__arm_lane(__vec, __idx)];     \

In passing - the function name in the @@ header is of course misleading, this 
is #define vgetq_lane_f16 (and the later hunks)

Thanks, Alan

Reply via email to