Re: Change GUC hashtable to use simplehash?

John Naylor Sun, 21 Jan 2024 18:04:10 -0800

I wrote:
>   fasthash_init(&hs, sizeof(Datum), kind);
>   fasthash_accum(&hs, (char *) &value, sizeof(Datum));
>   return fasthash_final32(&hs, 0);

It occurred to me that it's strange to have two places that length can
be passed. That was a side effect of the original, which used length
to both know how many bytes to read, and to modify the internal seed.
With the incremental API, it doesn't make sense to pass the length (or
a dummy macro) up front -- with a compile-time fixed length, it can't
possibly break a tie, so it's just noise.

0001 removes the length from initialization in the incremental
interface. The standalone functions use length directly the same as
before, but after initialization. Thoughts?

Also, the fasthash_accum call is a bit verbose, because it's often
used in a loop with varlen input. For register-sized values, I think
it's simpler to say this, as done in the search path cache, so maybe a
comment to that effect would be helpful:

hs.accum = value;
fasthash_combine(&hs);

I noticed that we already have a more recent, stronger 64-bit mixer
than murmur64: splitmix64, in pg_prng.c. We could put that, as well as
a better 4-byte mixer [1] in hashfn_unstable.h, for in-memory use.
Maybe with names like "hash_4bytes" etc. so it's not tied to a
specific implementation. I see one simplehash case that can use it,
even if the resowner hash table gets rid of it.

0002 and 0003 use fasthash for dynahash and GUC hash, respectively.
These cannot use the existing cstring hashing directly because of
truncation and case-folding, respectively. (Some simplehash uses can,
but that can come later)

On Sun, Jan 21, 2024 at 8:06 AM Jeff Davis <pg...@j-davis.com> wrote:
>
> After having stepped away from this work for a couple weeks and
> returning to it, I think the comments and/or naming could be more
> clear. We first use the result of haszero64() as a boolean to break out
> of the loop, but then later use it in a more interesting way to count
> the number of remaining bytes.
>
> Perhaps you can take the comment out of the loop and just describe the
> algorithm we're using, and make a note that we have to byteswap first.
> "Indeterminate" could be explained briefly as well.

v15-0004 is a stab at that. As an idea, it also renames zero_bytes_le
to zero_byte_low to reflect the effect better. There might be some
other comment edits needed to explain usage, so I plan to hold on to
this for later. Let me know what you think.

[1] Examples of both in
https://www.boost.org/doc/libs/1_84_0/boost/container_hash/detail/hash_mix.hpp

From ad25c43c264c5857bf41cbf056ac7d4ab0995b40 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sun, 21 Jan 2024 17:49:22 +0700
Subject: [PATCH v15 3/4] Use fasthash for guc_name_hash

---
 src/backend/utils/misc/guc.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 8f65ef3d89..e76ab52618 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "catalog/objectaccess.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_parameter_acl.h"
+#include "common/hashfn_unstable.h"
 #include "guc_internal.h"
 #include "libpq/pqformat.h"
 #include "parser/scansup.h"
@@ -1324,22 +1325,34 @@ guc_name_compare(const char *namea, const char *nameb)
 static uint32
 guc_name_hash(const void *key, Size keysize)
 {
-	uint32		result = 0;
 	const char *name = *(const char *const *) key;
+	const char *const start = name;
+	fasthash_state hs;
+
+	fasthash_init(&hs, 0);
 
 	while (*name)
 	{
-		char		ch = *name++;
+		int			chunk_len = 0;
 
-		/* Case-fold in the same way as guc_name_compare */
-		if (ch >= 'A' && ch <= 'Z')
-			ch += 'a' - 'A';
+		while (chunk_len < FH_SIZEOF_ACCUM && name[chunk_len] != '\0')
+		{
+			chunk_len++;
+			hs.accum <<= BITS_PER_BYTE;
+			hs.accum |= name[chunk_len];
+		}
 
-		/* Merge into hash ... not very bright, but it needn't be */
-		result = pg_rotate_left32(result, 5);
-		result ^= (uint32) ch;
+		/* Quick ASCII-only downcasing */
+		hs.accum |= 0x2020202020202020;
+
+		/* merge into hash and reset for next iteration */
+		fasthash_combine(&hs);
+		hs.accum = 0;
+
+		name += chunk_len;
 	}
-	return result;
+
+	return fasthash_final32(&hs, name - start);
 }
 
 /*
-- 
2.43.0

From e33633ba036ff521482fb24e8984b5865c8515c8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sun, 21 Jan 2024 15:33:22 +0700
Subject: [PATCH v15 4/4] WIP: comment edits

Clarify detection of zero bytes when hashing aligned C strings

Discussion: https://postgr.es/m/48e8f8bbe0be9c789f98776c7438244ab7a7cc63.camel%40j-davis.com
---
 src/include/common/hashfn_unstable.h | 38 ++++++++++++++++------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 8e829297fd..8c42e876be 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -209,26 +209,33 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 {
 	const char *const start = str;
 	int			remainder;
-	uint64		zero_bytes_le;
+	uint64		zero_byte_low;
 
 	Assert(PointerIsAligned(start, uint64));
+
+	/*
+	 * For every chunk of input, check for zero bytes before mixing into the
+	 * hash. The chunk with zeros must contain the NUL terminator. We arrange
+	 * so that zero_byte_low tells us not only that a zero exists, but also
+	 * where it is, so we can hash the remainder of the string.
+	 *
+	 * The haszero64 calculation will set bits corresponding to the lowest
+	 * byte where a zero exists, so that suffices for little-endian machines.
+	 * For big-endian machines, we would need bits set for the highest zero
+	 * byte in the chunk, since the trailing junk past the terminator could
+	 * contain additional zeros. haszero64 does not give us that, so we
+	 * byteswap the chunk first.
+	 */
 	for (;;)
 	{
 		uint64		chunk = *(uint64 *) str;
 
-		/*
-		 * With little-endian representation, we can use this calculation,
-		 * which sets bits in the first byte in the result word that
-		 * corresponds to a zero byte in the original word. The rest of the
-		 * bytes are indeterminate, so cannot be used on big-endian machines
-		 * without either swapping or a bytewise check.
-		 */
 #ifdef WORDS_BIGENDIAN
-		zero_bytes_le = haszero64(pg_bswap64(chunk));
+		zero_byte_low = haszero64(pg_bswap64(chunk));
 #else
-		zero_bytes_le = haszero64(chunk);
+		zero_byte_low = haszero64(chunk);
 #endif
-		if (zero_bytes_le)
+		if (zero_byte_low)
 			break;
 
 		hs->accum = chunk;
@@ -237,12 +244,11 @@ fasthash_accum_cstring_aligned(fasthash_state *hs, const char *str)
 	}
 
 	/*
-	 * For the last word, only use bytes up to the NUL for the hash. Bytes
-	 * with set bits will be 0x80, so calculate the first occurrence of a zero
-	 * byte within the input word by counting the number of trailing (because
-	 * little-endian) zeros and dividing the result by 8.
+	 * Bytes with set bits will be 0x80, so the number of trailing zeros will
+	 * be in the range 7, 15, ..., 63. We turn this into the byte position by
+	 * dividing by 8.
 	 */
-	remainder = pg_rightmost_one_pos64(zero_bytes_le) / BITS_PER_BYTE;
+	remainder = pg_rightmost_one_pos64(zero_byte_low) / BITS_PER_BYTE;
 	fasthash_accum(hs, str, remainder);
 	str += remainder;
 
-- 
2.43.0

From 46e84c4782e2a1291430be4a7d4651de7f387608 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sun, 21 Jan 2024 19:19:14 +0700
Subject: [PATCH v15 1/4] Initialization of incremental hashing no longer uses
 length

When the incremental interface was written, care was taken to make
sure the answer always matched upstream when the length was known
ahead of time.

When that is not known ahead of time, callers that create a hash
function using the incremental interface were already advised to
incorporate length into the finalizer once it is known e.g. after
hashing a C string. Experimentation has shown that this also works
well when the length is known ahead of time, so there's no advantage
to having two places to pass the length.

Further, if the length is a compile-time constant in this case,
it can't possibly be needed as a tiebreaker for this caller, so
there's not much point in using it to affect the internal seed upon
initialization.

It's worthwhile that the standalone functions fasthash{32,64} can still
give the same answer as the original namesakes, but it's trivial for
them to reset the seed after initialization. Hence, do that and remove
"len" from fasthash_init, as well as the macro for unknown length.

TODO: comment updates
---
 src/backend/catalog/namespace.c      | 2 +-
 src/include/common/hashfn_unstable.h | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index b610aa6242..8df30b2440 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -256,7 +256,7 @@ spcachekey_hash(SearchPathCacheKey key)
 	fasthash_state hs;
 	int			sp_len;

-	fasthash_init(&hs, FH_UNKNOWN_LENGTH, 0);
+	fasthash_init(&hs, 0);

 	hs.accum = key.roleid;
 	fasthash_combine(&hs);
diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 3d927e1fb1..8e829297fd 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -89,7 +89,6 @@ typedef struct fasthash_state

 #define FH_SIZEOF_ACCUM sizeof(uint64)

-#define FH_UNKNOWN_LENGTH 1

 /*
  * Initialize the hash state.
@@ -99,10 +98,10 @@ typedef struct fasthash_state
  * 'seed' can be zero.
  */
 static inline void
-fasthash_init(fasthash_state *hs, int len, uint64 seed)
+fasthash_init(fasthash_state *hs, uint64 seed)
 {
 	memset(hs, 0, sizeof(fasthash_state));
-	hs->hash = seed ^ (len * 0x880355f21e6d1965);
+	hs->hash = seed ^ 0x880355f21e6d1965;
 }

 /* both the finalizer and part of the combining step */
@@ -328,7 +327,8 @@ fasthash64(const char *k, int len, uint64 seed)
 {
 	fasthash_state hs;

-	fasthash_init(&hs, len, seed);
+	fasthash_init(&hs, 0);
+	hs.hash = seed ^ (len * 0x880355f21e6d1965);

 	while (len >= FH_SIZEOF_ACCUM)
 	{
-- 
2.43.0

From 54ef850a0bb909b242ec553b3ea84853611ec233 Mon Sep 17 00:00:00 2001
From: John Naylor <john.nay...@postgresql.org>
Date: Sun, 21 Jan 2024 16:04:16 +0700
Subject: [PATCH v15 2/4] Use fasthash for dynahash's default string hash

This avoids strlen calls. string_hash is kept around in case
extensions are using it.
---
 src/backend/utils/hash/dynahash.c | 52 +++++++++++++++++++++++++++----
 src/common/hashfn.c               |  3 +-
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index a4152080b5..92c7989575 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -307,6 +308,45 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+/*
+ * default_string_hash: hash function for keys that are NUL-terminated strings.
+ *
+ * NOTE: this is the default hash function if none is specified.
+ */
+static uint32
+default_string_hash(const void *key, Size keysize)
+{
+	const char *k = (const char *) key;
+	Size		s_len = 0;
+	fasthash_state hs;
+
+	/*
+	 * If the string exceeds keysize-1 bytes, we want to hash only that many,
+	 * because when it is copied into the hash table it will be truncated at
+	 * that length.
+	 */
+
+	fasthash_init(&hs, 0);
+
+	while (*k && s_len < keysize - 1)
+	{
+		int			chunk_len = 0;
+
+		while (k[chunk_len] != '\0' &&
+			   s_len < keysize - 1 &&
+			   chunk_len < FH_SIZEOF_ACCUM)
+		{
+			chunk_len++;
+			s_len++;
+		}
+
+		fasthash_accum(&hs, k, chunk_len);
+		k += chunk_len;
+	}
+
+	return fasthash_final32(&hs, s_len);
+}
+
 
 /************************** CREATE ROUTINES **********************/
 
@@ -418,8 +458,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	else
 	{
 		/*
-		 * string_hash used to be considered the default hash method, and in a
-		 * non-assert build it effectively still is.  But we now consider it
+		 * string_hash used to be considered the default hash method, and
+		 * it effectively still was until version 17.  Since version 14 we consider it
 		 * an assertion error to not say HASH_STRINGS explicitly.  To help
 		 * catch mistaken usage of HASH_STRINGS, we also insist on a
 		 * reasonably long string length: if the keysize is only 4 or 8 bytes,
@@ -428,12 +468,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		Assert(flags & HASH_STRINGS);
 		Assert(info->keysize > 8);
 
-		hashp->hash = string_hash;
+		hashp->hash = default_string_hash;
 	}
 
 	/*
 	 * If you don't specify a match function, it defaults to string_compare if
-	 * you used string_hash, and to memcmp otherwise.
+	 * you used default_string_hash, and to memcmp otherwise.
 	 *
 	 * Note: explicitly specifying string_hash is deprecated, because this
 	 * might not work for callers in loadable modules on some platforms due to
@@ -442,7 +482,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (flags & HASH_COMPARE)
 		hashp->match = info->match;
-	else if (hashp->hash == string_hash)
+	else if (hashp->hash == default_string_hash)
 		hashp->match = (HashCompareFunc) string_compare;
 	else
 		hashp->match = memcmp;
@@ -452,7 +492,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (flags & HASH_KEYCOPY)
 		hashp->keycopy = info->keycopy;
-	else if (hashp->hash == string_hash)
+	else if (hashp->hash == default_string_hash)
 	{
 		/*
 		 * The signature of keycopy is meant for memcpy(), which returns
diff --git a/src/common/hashfn.c b/src/common/hashfn.c
index 4db468cf85..3090b3cbd9 100644
--- a/src/common/hashfn.c
+++ b/src/common/hashfn.c
@@ -654,7 +654,8 @@ hash_bytes_uint32_extended(uint32 k, uint64 seed)
 /*
  * string_hash: hash function for keys that are NUL-terminated strings.
  *
- * NOTE: this is the default hash function if none is specified.
+ * NOTE: this was the default string hash for dynahash until vesion 17,
+ * and is now here only for backward compatibility.
  */
 uint32
 string_hash(const void *key, Size keysize)
-- 
2.43.0

Re: Change GUC hashtable to use simplehash?

Reply via email to