On Tue, Mar 19, 2024 at 04:53:04PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
> <nathandboss...@gmail.com> wrote:
>> 0002 does the opposite of this.  That is, after we've completed as many
>> blocks as possible, we move the iterator variable back to "end -
>> block_size" and do one final iteration to cover all the remaining elements.
> 
> Sounds similar in principle, but it looks really complicated. I don't
> think the additional loops and branches are a good way to go, either
> for readability or for branch prediction. My sketch has one branch for
> which loop to do, and then performs only one loop. Let's do the
> simplest thing that could work. (I think we might need a helper
> function to do the block, but the rest should be easy)

I tried to trim some of the branches, and came up with the attached patch.
I don't think this is exactly what you were suggesting, but I think it's
relatively close.  My testing showed decent benefits from using 2 vectors
when there aren't enough elements for 4, so I've tried to keep that part
intact.  This changes pg_lfind32() to something like:

        if not many elements
                process one by one

        while enough elements for 4 registers remain
                process with 4 registers

        if no elements remain
                return false

        if more than 2-registers-worth of elements remain
                do one iteration with 2 registers

        do another iteration on last 2-registers-worth of elements

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..d154b61555 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,34 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+static inline bool
+lfind32_2reg_helper(const Vector32 keys, uint32 *base)
+{
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	Vector32	vals1,
+				vals2,
+				result1,
+				result2,
+				result;
+
+	/* load the next block into 2 registers */
+	vector32_load(&vals1, base);
+	vector32_load(&vals2, &base[nelem_per_vector]);
+
+	/* compare each value to the key */
+	result1 = vector32_eq(keys, vals1);
+	result2 = vector32_eq(keys, vals2);
+
+	/* combine the results into a single variable */
+	result = vector32_or(result1, result2);
+
+	/* see if there was a match */
+	if (vector32_is_highbit_set(result))
+		return true;
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -100,7 +128,8 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	 */
 	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
-	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	uint32 remaining;
 
 	/* round down to multiple of elements per iteration */
 	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
@@ -119,6 +148,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	}
 #endif
 
+	if (nelem < nelem_per_vector * 2)
+		goto slow_path;
+
 	for (i = 0; i < tail_idx; i += nelem_per_iteration)
 	{
 		Vector32	vals1,
@@ -157,8 +189,21 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			return true;
 		}
 	}
+
+	nelem_per_iteration = 2 * nelem_per_vector;
+	remaining = nelem - i;
+
+	if (remaining == 0)
+		return false;
+
+	if (remaining > nelem_per_iteration &&
+		lfind32_2reg_helper(keys, &base[i]))
+		return true;
+
+	return lfind32_2reg_helper(keys, &base[nelem - nelem_per_iteration]);
 #endif							/* ! USE_NO_SIMD */
 
+slow_path:
 	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{

Reply via email to