I wrote: > Next step : Test whether it's worth it for the common prefix detection > to use the normalized datum.
This turned out to be a loser, but in the course of trying it a better idea occurred to me. v8's prefix detection was really a special-case optimization where the sort key is all non-negative integers (or all negative, but that's not common). It's wasted work when the input is mixed in sign, and for abbreviated keys. It's not much of a waste, but we can do better. v9 computes the common prefix during every recursion at the same time we populate the SortTuple's current byte. That should be practically free given a modest amount of instruction-level parallelism. As a demo, I slightly modified the "p5" test (an adversarial case for radix sort) from https://www.postgresql.org/message-id/CANWCAZb0STJRAnn8fcVccQmektizNkgqBD_cg%2B0cs1%3Db4WWajQ%40mail.gmail.com ...so that all bytes differ only in the least significant byte or in the sign bit. select setseed(0.935349); with r (rand, i) as (select random(-100, 100), i from generate_series(1,1_000_000,1) i ) insert into tsort select case when r.rand < -95 or r.rand > 95 then r.rand else 0 end from r; vacuum freeze tsort ; cat bench.sql select * from tsort order by i offset 100000000; pgbench -n -t 500 -f bench.sql | grep latency (on pre-warmed buffers with work_mem 64MB) master: latency average = 99.663 ms v8: latency average = 102.040 ms v9: latency average = 84.856 ms Here we can see that common prefix detection is useless in v8. It adds work, but it's not worse than noise in this benchmark, so could be disregarded. In v9 we proceed directly to performing a radix sort pass on the most significant byte, which effectively partitions into signed and unsigned. The next recursion step detects that all bytes are the same and simultaneously calculates how far to skip so that the next recursion is productive. For the case of all non-negative integers: select setseed(0.935349); insert into tsort select random(0, 100) from generate_series(1,1_000_000,1) i; vacuum freeze tsort ; ...in v9 the first pass computes the current byte from the normalized datum and stores it, but this pass is unproductive. There was a risk that this is more expensive than v8's up-front common prefix detection because of memory writes. I only see noise-level differences: master: latency average = 98.725 ms v8: latency average = 75.572 ms v9: latency average = 76.635 ms I intend to commit v9-0001 this week. I've decided not to commit v8-0003 (message changes) from my last message since it doesn't really add any value IMO. That can always be revisited. I'm on the fence about v8-0004 (assert the correct order also for qsort), but it seems good for consistency. I'd like to at least continue asserting this for radix sort since it's new this cycle. To save CI cycles etc, maybe we can eventually hide the check behind a debug symbol. -- John Naylor Amazon Web Services
From 6e94b0a260bc81248375c7d45568a52f17274bf9 Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Mon, 30 Mar 2026 12:26:58 +0700 Subject: [PATCH v9] Detect common prefixes during radix sort During the counting step, keep track of the bits that are the same for the entire input. If we counted only a single distinct byte, the next recursion will start at the next byte position that has more than one distinct byte in the input. This allows us to skip over multiple passes where the byte is the same for the entire input. This provides a significant speedup for integers that have some upper bytes with all-zeros or all-ones, which is common. Reviewed-by: Chengpeng Yan <[email protected]> Discussion: https://postgr.es/m/canwcazypgmdsswaa18foxjgxapzvdypswpokfcx32dwh3qz...@mail.gmail.com --- src/backend/utils/sort/tuplesort.c | 44 +++++++++++++++++++++++++++--- 1 file changed, 40 insertions(+), 4 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 1fc440ea6ca..72c2c2995d8 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -104,6 +104,7 @@ #include "commands/tablespace.h" #include "miscadmin.h" #include "pg_trace.h" +#include "port/pg_bitutils.h" #include "storage/shmem.h" #include "utils/guc.h" #include "utils/memutils.h" @@ -2659,17 +2660,25 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate int num_partitions = 0; int num_remaining; SortSupport ssup = &state->base.sortKeys[0]; + Datum ref_datum; + Datum common_upper_bits = 0; size_t start_offset = 0; SortTuple *partition_begin = begin; + int next_level; /* count number of occurrences of each byte */ + ref_datum = normalize_datum(begin[0].datum1, ssup); for (SortTuple *st = begin; st < begin + n_elems; st++) { + Datum this_datum; uint8 this_partition; + this_datum = normalize_datum(st->datum1, ssup); + /* accumulate bits different from the reference datum */ + common_upper_bits |= ref_datum ^ this_datum; + /* extract the byte for this level from the normalized datum */ - this_partition = current_byte(normalize_datum(st->datum1, ssup), - level); + this_partition = current_byte(this_datum, level); /* save it for the permutation step */ st->curbyte = this_partition; @@ -2747,6 +2756,33 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate } /* recurse */ + + if (num_partitions == 1) + { + /* + * There is only one distinct byte at the current level. It can happen + * that some subsequent bytes are also the same for all input values, + * such as the upper bytes of small integers. To skip unproductive + * passes for that case, we compute the level where the input has more + * than one distinct byte, so that the next recursion can start there. + */ + if (common_upper_bits == 0) + next_level = sizeof(Datum); + else + { + int diffpos; + + /* + * The upper bits of common_upper_bits are zero where all datums + * have the same bits. + */ + diffpos = pg_leftmost_one_pos64(DatumGetUInt64(common_upper_bits)); + next_level = sizeof(Datum) - 1 - (diffpos / BITS_PER_BYTE); + } + } + else + next_level = level + 1; + for (uint8 *rp = remaining_partitions; rp < remaining_partitions + num_partitions; rp++) @@ -2757,7 +2793,7 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate if (num_elements > 1) { - if (level < sizeof(Datum) - 1) + if (next_level < sizeof(Datum)) { if (num_elements < QSORT_THRESHOLD) { @@ -2770,7 +2806,7 @@ radix_sort_recursive(SortTuple *begin, size_t n_elems, int level, Tuplesortstate { radix_sort_recursive(partition_begin, num_elements, - level + 1, + next_level, state); } } -- 2.53.0
