I wrote:
> In the end, I want to add a length check so
> that inputs smaller than 80 bytes go straight to the scalar path.
> Above 80, after alignment adjustments in the preamble, that still
> guarantees at least one loop iteration in the vector path.
Attached is how that would look. The idea is that small inputs will
encounter fewer branches. It'd be tricky to prove a difference with a
benchmark, and I see this as just making the small-input path more
similar to PG 18, as a risk-avoidance maneuver.
--
John Naylor
Amazon Web Services
diff --git a/src/port/pg_crc32c_armv8.c b/src/port/pg_crc32c_armv8.c
index 5fa57fb4927..682e8a1ca5d 100644
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -127,19 +127,28 @@ pg_comp_crc32c_pmull(pg_crc32c crc, const void *data, size_t len)
pg_crc32c crc0 = crc;
const char *buf = data;
+ /*
+ * Immediately fall back to scalar path if the vector path is not
+ * guaranteed to perform at least one iteration after the alignment
+ * preamble.
+ */
+ if (len < 5 * sizeof(uint64x2_t))
+ return pg_comp_crc32c_armv8(crc, data, len);
+
/* align to 16 bytes */
- for (; len && ((uintptr_t) buf & 7); --len)
+ for (; (uintptr_t) buf & 7; --len)
{
crc0 = __crc32cb(crc0, *buf++);
}
- if (((uintptr_t) buf & 8) && len >= 8)
+ if ((uintptr_t) buf & 8)
{
crc0 = __crc32cd(crc0, *(const uint64_t *) buf);
buf += 8;
len -= 8;
}
- if (len >= 64)
+ Assert(len >= 64);
+
{
const char *end = buf + len;
const char *limit = buf + len - 64;