On 02/09/14 16:28, Richard Henderson wrote:
Is it intentional or not that AArch64 does not define __ARM_NEON__?

Yes I remember so, __ARM_NEON__ is not ACLE compatible so we haven't defined it for AArch64 - on AArch32 and AArch64 we now have __ARM_NEON defined so that's the macro to be used.


Otherwise, here's a better way to fold the test bits.  AArch64 of
course does not have dN+1 overlap the high part of the qM register,
like AArch32, so the current

       l = vpadd_u8 (vget_low_u8 (t), vget_high_u8 (t));

implies extra register moves.  But on the good side, the armv8 ADDV
instruction allows two instructions to be removed from this fast path.

Cool.


When built for 32-bit, the new form results in the same instruction
count; we simply keep using "q" registers instead of "d" registers
for two more insns.  Given that there are currently ifdefs involved,
it would certainly be possible to keep the 32-bit path unchanged, if
that's thought to be valuable.

The ADDV instruction isn't available on the AArch32 side IIRC. Given that situation there is no intrinsic for ADDV on the AArch32 side which is why this doesn't exist in the AArch32 version of arm_neon.h :(

I'll need to take a look at the new code generated for AArch32 and will probably be able to get back tomorrow as I'll disappear shortly.


I did wonder if the armv8 stuff was supposed to be included in the
AArch32 arm_neon.h?  Is it just an oversight that it's missing?

The ARMv8 stuff is included for arm_neon.h - I believe we've implemented everything that's ARMv8 specific in arm_neon.h for AArch32 . Anything missing would be an oversight.


regards
Ramana





r~


        * lex.c (search_line_fast) [__ARM_NEON]: Use __FOO not __FOO__
        to detect neon support.  Fold the comparison using ADDV when
        available.


diff --git a/libcpp/lex.c b/libcpp/lex.c
index 5366dad..6d1823e 100644
--- a/libcpp/lex.c
+++ b/libcpp/lex.c
@@ -638,7 +638,7 @@ search_line_fast (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)
    }
  }

-#elif defined (__ARM_NEON__)
+#elif defined (__ARM_NEON)
  #include "arm_neon.h"

  static const uchar *
@@ -649,6 +649,7 @@ search_line_fast (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)
    const uint8x16_t repl_bs = vdupq_n_u8 ('\\');
    const uint8x16_t repl_qm = vdupq_n_u8 ('?');
    const uint8x16_t xmask = (uint8x16_t) vdupq_n_u64 (0x8040201008040201ULL);
+  const int16x8_t shift = { 0, 0, 0, 0, 8, 8, 8, 8 };

    unsigned int misalign, found, mask;
    const uint8_t *p;
@@ -670,10 +671,8 @@ search_line_fast (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)

    do
      {
-      uint8x8_t l;
-      uint16x4_t m;
-      uint32x2_t n;
        uint8x16_t t, u, v, w;
+      uint16x8_t l;

        p += 16;
        data = vld1q_u8 (p);
@@ -685,12 +684,24 @@ search_line_fast (const uchar *s, const uchar *end 
ATTRIBUTE_UNUSED)
        v = vorrq_u8 (t, vceqq_u8 (data, repl_bs));
        w = vorrq_u8 (u, vceqq_u8 (data, repl_qm));
        t = vandq_u8 (vorrq_u8 (v, w), xmask);
-      l = vpadd_u8 (vget_low_u8 (t), vget_high_u8 (t));
-      m = vpaddl_u8 (l);
-      n = vpaddl_u16 (m);
-
-      found = vget_lane_u32 ((uint32x2_t) vorr_u64 ((uint64x1_t) n,
-             vshr_n_u64 ((uint64x1_t) n, 24)), 0);
+
+      l = vpaddlq_u8 (t);
+      l = vshlq_u16 (l, shift);
+
+      /* ??? Ideally, this would be if (__ARM_ARCH >= 8) since the ADDV insn
+        reduces the instruction count by two.  But vaddvq is not present in
+        the arm32 arm_neon.h, nor does AArch64 define __ARM_ARCH.  */
+#ifdef __aarch64__
+      found = vaddvq_u16 (l);
+#else
+      {
+       uint32x4_t m = vpaddlq_u16 (l);
+       uint64x2_t n = vpaddlq_u32 (m);
+       uint64x1_t o = vget_low_u64 (n) + vget_high_u64 (n);
+       found = vget_lane_u32 ((uint32x2_t)o, 0);
+      }
+#endif
+
        found &= mask;
      }
    while (!found);

Reply via email to