Re: tuple radix sort

John Naylor Sun, 29 Mar 2026 22:58:20 -0700

I wrote:
> Next step : Test whether it's worth it for the common prefix detection
> to use the normalized datum.

This turned out to be a loser, but in the course of trying it a better
idea occurred to me. v8's prefix detection was really a special-case
optimization where the sort key is all non-negative integers (or all
negative, but that's not common). It's wasted work when the input is
mixed in sign, and for abbreviated keys. It's not much of a waste, but
we can do better.

v9 computes the common prefix during every recursion at the same time
we populate the SortTuple's current byte. That should be practically
free given a modest amount of instruction-level parallelism.

As a demo, I slightly modified the "p5" test (an adversarial case for
radix sort) from

https://www.postgresql.org/message-id/CANWCAZb0STJRAnn8fcVccQmektizNkgqBD_cg%2B0cs1%3Db4WWajQ%40mail.gmail.com

...so that all bytes differ only in the least significant byte or in
the sign bit.

select setseed(0.935349);
with r (rand, i) as (select random(-100, 100), i
from generate_series(1,1_000_000,1) i )
insert into tsort
select case
when r.rand < -95 or r.rand > 95
then r.rand
else 0 end
from r;

vacuum freeze tsort ;

cat bench.sql
select * from tsort order by i offset 100000000;

pgbench -n -t 500 -f bench.sql | grep latency
(on pre-warmed buffers with work_mem 64MB)

master:
latency average = 99.663 ms
v8:
latency average = 102.040 ms
v9:
latency average = 84.856 ms

Here we can see that common prefix detection is useless in v8. It adds
work, but it's not worse than noise in this benchmark, so could be
disregarded. In v9 we proceed directly to performing a radix sort pass
on the most significant byte, which effectively partitions into signed
and unsigned. The next recursion step detects that all bytes are the
same and simultaneously calculates how far to skip so that the next
recursion is productive.

For the case of all non-negative integers:

select setseed(0.935349);
insert into tsort
select random(0, 100)
from generate_series(1,1_000_000,1) i;
vacuum freeze tsort ;

...in v9 the first pass computes the current byte from the normalized
datum and stores it, but this pass is unproductive. There was a risk
that this is more expensive than v8's up-front common prefix detection
because of memory writes. I only see noise-level differences:

master:
latency average = 98.725 ms
v8:
latency average = 75.572 ms
v9:
latency average = 76.635 ms

I intend to commit v9-0001 this week.

I've decided not to commit v8-0003 (message changes) from my last
message since it doesn't really add any value IMO. That can always be
revisited. I'm on the fence about v8-0004 (assert the correct order
also for qsort), but it seems good for consistency. I'd like to at
least continue asserting this for radix sort since it's new this
cycle. To save CI cycles etc, maybe we can eventually hide the check
behind a debug symbol.

--
John Naylor
Amazon Web Services

From 6e94b0a260bc81248375c7d45568a52f17274bf9 Mon Sep 17 00:00:00 2001
From: John Naylor <[email protected]>
Date: Mon, 30 Mar 2026 12:26:58 +0700
Subject: [PATCH v9] Detect common prefixes during radix sort

During the counting step, keep track of the bits that are the same
for the entire input.  If we counted only a single distinct byte,
the next recursion will start at the next byte position that has
more than one distinct byte in the input. This allows us to skip over
multiple passes where the byte is the same for the entire input.

This provides a significant speedup for integers that have some upper
bytes with all-zeros or all-ones, which is common.

Reviewed-by: Chengpeng Yan <[email protected]>
Discussion: https://postgr.es/m/canwcazypgmdsswaa18foxjgxapzvdypswpokfcx32dwh3qz...@mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c | 44 +++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1fc440ea6ca..72c2c2995d8 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -104,6 +104,7 @@
 #include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -2659,17 +2660,25 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate
 	int			num_partitions = 0;
 	int			num_remaining;
 	SortSupport ssup = &state->base.sortKeys[0];
+	Datum		ref_datum;
+	Datum		common_upper_bits = 0;
 	size_t		start_offset = 0;
 	SortTuple  *partition_begin = begin;
+	int			next_level;
 
 	/* count number of occurrences of each byte */
+	ref_datum = normalize_datum(begin[0].datum1, ssup);
 	for (SortTuple *st = begin; st < begin + n_elems; st++)
 	{
+		Datum		this_datum;
 		uint8		this_partition;
 
+		this_datum = normalize_datum(st->datum1, ssup);
+		/* accumulate bits different from the reference datum */
+		common_upper_bits |= ref_datum ^ this_datum;
+
 		/* extract the byte for this level from the normalized datum */
-		this_partition = current_byte(normalize_datum(st->datum1, ssup),
-									  level);
+		this_partition = current_byte(this_datum, level);
 
 		/* save it for the permutation step */
 		st->curbyte = this_partition;
@@ -2747,6 +2756,33 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate
 	}
 
 	/* recurse */
+
+	if (num_partitions == 1)
+	{
+		/*
+		 * There is only one distinct byte at the current level. It can happen
+		 * that some subsequent bytes are also the same for all input values,
+		 * such as the upper bytes of small integers. To skip unproductive
+		 * passes for that case, we compute the level where the input has more
+		 * than one distinct byte, so that the next recursion can start there.
+		 */
+		if (common_upper_bits == 0)
+			next_level = sizeof(Datum);
+		else
+		{
+			int			diffpos;
+
+			/*
+			 * The upper bits of common_upper_bits are zero where all datums
+			 * have the same bits.
+			 */
+			diffpos = pg_leftmost_one_pos64(DatumGetUInt64(common_upper_bits));
+			next_level = sizeof(Datum) - 1 - (diffpos / BITS_PER_BYTE);
+		}
+	}
+	else
+		next_level = level + 1;
+
 	for (uint8 *rp = remaining_partitions;
 		 rp < remaining_partitions + num_partitions;
 		 rp++)
@@ -2757,7 +2793,7 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate
 
 		if (num_elements > 1)
 		{
-			if (level < sizeof(Datum) - 1)
+			if (next_level < sizeof(Datum))
 			{
 				if (num_elements < QSORT_THRESHOLD)
 				{
@@ -2770,7 +2806,7 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate
 				{
 					radix_sort_recursive(partition_begin,
 										 num_elements,
-										 level + 1,
+										 next_level,
 										 state);
 				}
 			}
-- 
2.53.0

Re: tuple radix sort

Reply via email to