> On Nov 12, 2025, at 21:57, John Naylor <[email protected]> wrote:
>
> I decided I wasn't quite comfortable with the full normalized datum
> sharing space in SortTuple with isnull1. There's too much of a
> cognitive burden involved in deciding when we do or don't need to
> reset isnull1, and there's a non-zero risk of difficult-to-detect
> bugs. For v4 I've instead used one byte of padding space in SortTuple
> to store only the byte used for the current pass. That means we must
> compute the normalized datum on every pass. That's actually better
> than it sounds, since that one byte can now be used directly during
> the "deal" step, rather than having to extract the byte from the
> normalized datum by shifting and masking. That extraction step might
> add significant cycles in cases where a pass requires multiple
> iterations through the "deal" loop. It doesn't seem to make much
> difference in practice, performance-wise, even with the following
> pessimization:
>
> I had to scrap the qsort specialization on the normalized datum for
> small sorts, since it's no longer stored. It could still be worth it
> to compute the "next byte of the normalized datum" and perform a qsort
> on that (falling back to the comparator function in the usual way),
> but I haven't felt the need to resort to that yet. For v4, I just
> divert to qsort_tuple in non-assert builds, with a threshold of 40.
> […]
>
> I made an attempt at clean-up, but it's still under-commented. The
> common prefix detection has moved to a separate patch (v4-0004).
>
> I've been forcing all eligible sorts to use radix sort in assert
> builds, even when small enough that qsort would be faster. Since both
> qsort and in-place radix sort are unstable, it's expected that some
> regression tests need adjustment (v4-0002). One thing surprised me,
> however: The pg_upgrade TAP test that runs regression tests on the old
> cluster showed additional failures that I can't explain. I haven't
> seen this before, but it's possible I never ran TAP tests when testing
> new sort algorithms previously. This doesn't happen if you change the
> current insertion sort threshold, so I haven't been able to reproduce
> it aside from this patch. For that reason I can't completely rule out
> an actual bug, although I actually have more confidence in the
> verification of correct sort order in v4, since isnull1 now never
> changes, just as in master. I found that changing some tests to have
> additional sort keys seems to fix it (v4-0003). I did this in a rather
> quick and haphazard fashion. There's probably a longer conversation to
> be had about making test output more deterministic while still
> covering the intended executor paths.
>
> Aside from that, this seems like a good place to settle down, so I'm
> going to create a CF entry for this. I'll start more rigorous
> performance testing in the near future.
>
Hi John,
I have reviewed the v4 patch set. I applied the patches and ran
"make check" on macOS 14.2.1 (M1), and all regression tests passed.
Overall, the implementation looks solid and the feature is a great
addition. I have a few suggestions regarding code optimization and one
discussion point regarding the data structure design.
Minor Comments / Optimizations:
1. Optimization in radix_sort_tuple (v4-0001)
In radix_sort_tuple, there is a check:
```
if (part.offset == part.next_offset)
```
Since "part" is a local copy of the struct, this check might not
reflect the latest state updated inside the loop. It might be slightly
more efficient to check the array directly:
```
if (partitions[idx].offset == partitions[idx].next_offset)
```
This allows us to detect if a partition has been fully consumed/sorted
within the current pass, potentially saving an iteration of the
"while (num_remaining > 1)" loop.
1. Branchless calculation in Common Prefix (v4-0004)
In sort_byvalue_datum, when calculating the common bits:
```
if (this_common_bits > common_upper_bits)
common_upper_bits = this_common_bits;
```
Since we are looking for the leftmost bit difference, we could
accumulate the differences using bitwise OR. This avoids a conditional
branch inside the loop:
```
common_upper_bits |= this_common_bits;
```
3. Short-circuit for identical keys (v4-0004)
When calculating common_prefix, if common_upper_bits is 0, it implies
that all non-null keys are identical (for the bits we care about). In
this case, we might be able to skip the radix sort entirely or handle
it as a single partition. Currently, the code handles it by passing
"common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but
perhaps not the most optimal path for identical keys.
-----
Discussion: SortTuple Layout and normalize_datum
I have a concern regarding the addition of "uint8 current_byte" to the
SortTuple struct.
1. Struct Size and Abstraction:
SortTuple is the generic atom for the entire sorting module. Adding a
field specific to integer Radix Sort feels like a bit of a leaky
abstraction. While it might fit into padding on some 64-bit platforms
(keeping the size at 32 bytes), relying on padding is fragile. If the
struct size increases, it reduces the number of tuples work_mem can
hold, affecting all sort operations, not just integers.
2. Cache Invalidation (Read vs. Write):
Currently, radix_sort_tuple writes to tup->current_byte in every pass.
This turns what could be a read-only access to the tuple array into a
read-write access, potentially dirtying cache lines and causing
unnecessary memory traffic.
Proposal: On-the-fly Normalization
We can avoid storing current_byte by calculating the relevant byte on
the fly. While this means re-calculating the byte extraction, we can
optimize normalize_datum.
The normalization logic (mapping signed integers to unsigned space)
essentially flips the sign bit. Instead of doing a full 64-bit
normalization for every byte extraction, we can apply this
transformation only when extracting the most significant byte (MSB).
The logic could look like this:
```
/*
* Extract the byte at 'level' from 'key'.
* level 0 denotes the Most Significant Byte.
*/
static inline uint8_t
extract_raw_byte(Datum key, int level)
{
return (key >> (((SIZEOF_DATUM - 1) - level) * 8)) & 0xFF;
}
/*
* Extract the "logically normalized" byte at 'level'.
*
* This effectively replaces the need to call normalize_datum() on
* the full key.
* - For non-MSB levels: return the raw byte.
* - For the MSB level: flip the sign bit if the comparator
* requires it.
* - If sorting DESC: invert the result.
*/
static inline uint8_t
radix_extract_byte(Datum orig, int level, SortSupport ssup)
{
uint8_t byte;
/* 1. Extract raw byte first */
byte = extract_raw_byte(orig, level);
/*
* 2. Apply normalization only to the Most Significant Byte.
*
* Mathematically, normalizing a signed integer for unsigned
* comparison is equivalent to flipping the sign bit (adding
* 2^(Width-1)).
* In the byte domain, this means XORing the MSB with 0x80.
*/
if (level == 0)
{
if (ssup->comparator == ssup_datum_signed_cmp ||
ssup->comparator == ssup_datum_int32_cmp)
{
byte ^= 0x80;
}
else
{
Assert(ssup->comparator == ssup_datum_unsigned_cmp);
}
}
/*
* 3. Handle Reverse Sort
* Instead of inverting the whole Datum, we just invert the
* current byte.
*/
if (ssup->ssup_reverse)
byte = ~byte;
return byte;
}
```
Benefits:
1. Keeps SortTuple minimal: No risk of struct bloat.
2. Read-only passes: The tuple array remains clean in cache during the
counting phase.
3. Reduced instruction count: We avoid the overhead of full
normalize_datum calls for lower bytes.
I'm happy to help verify this approach if you think it's worth
pursuing.
Best regards,
Chengpeng Yan