On Sun, Nov 23, 2025 at 12:33 PM Chengpeng Yan
<[email protected]> wrote:
>
> > On Nov 12, 2025, at 21:57, John Naylor <[email protected]> wrote:
> I have reviewed the v4 patch set. I applied the patches and ran
> "make check" on macOS 14.2.1 (M1), and all regression tests passed.
>
> Overall, the implementation looks solid and the feature is a great
> addition. I have a few suggestions regarding code optimization and one
> discussion point regarding the data structure design.

Thanks for taking a look!

> Minor Comments / Optimizations:
>
> 1. Optimization in radix_sort_tuple (v4-0001)
>
> In radix_sort_tuple, there is a check:
>
> ```
> if (part.offset == part.next_offset)
> ```
>
> Since "part" is a local copy of the struct, this check might not
> reflect the latest state updated inside the loop. It might be slightly
> more efficient to check the array directly:
>
> ```
> if (partitions[idx].offset == partitions[idx].next_offset)
> ```
>
> This allows us to detect if a partition has been fully consumed/sorted
> within the current pass, potentially saving an iteration of the
> "while (num_remaining > 1)" loop.

I wondered about that and then forgot about it. Thanks for bringing
that up, I'll look into it.

> 1. Branchless calculation in Common Prefix (v4-0004)
>
> In sort_byvalue_datum, when calculating the common bits:
>
> ```
> if (this_common_bits > common_upper_bits)
> common_upper_bits = this_common_bits;
> ```
>
> Since we are looking for the leftmost bit difference, we could
> accumulate the differences using bitwise OR. This avoids a conditional
> branch inside the loop:
>
> ```
> common_upper_bits |= this_common_bits;
> ```

Good idea.

> 3. Short-circuit for identical keys (v4-0004)
>
> When calculating common_prefix, if common_upper_bits is 0, it implies
> that all non-null keys are identical (for the bits we care about). In
> this case, we might be able to skip the radix sort entirely or handle
> it as a single partition. Currently, the code handles it by passing
> "common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but
> perhaps not the most optimal path for identical keys.

Right. If all values of the first sort key are the same, v4 still
wastes a bit of time counting the lowest byte. This shouldn't happen
for abbreviated keys since low cardinality will cause abbreviation to
abort, but it's still possible to hit that case if the first key is an
integer.

> Discussion: SortTuple Layout and normalize_datum
>
> I have a concern regarding the addition of "uint8 current_byte" to the
> SortTuple struct.
>
> 1. Struct Size and Abstraction:
>
> SortTuple is the generic atom for the entire sorting module. Adding a
> field specific to integer Radix Sort feels like a bit of a leaky
> abstraction.

I would say the only generic part of SortTuple is the "void* tuple"
pointer. Most of the rest of the fields are a cached copy of something
extracted from the actual tuple. The exception is "srctape", which is
only used for bookkeeping during external merges. I'm not sure I would
call srctape a leaky abstraction (I would call this a "fat struct"),
but either way we have some precedence for both cached info and for
info specific to one mode of sorting.

> While it might fit into padding on some 64-bit platforms
> (keeping the size at 32 bytes), relying on padding is fragile.

I don't see how any platform of any pointer size can fail to have a
padding byte since "isnull1" is a boolean. A long time ago we used to
allow platforms to have 32-bit booleans, but no longer, so I think
it's actually not fragile. (FYI: 24 bytes.)

> 2. Cache Invalidation (Read vs. Write):
>
> Currently, radix_sort_tuple writes to tup->current_byte in every pass.
> This turns what could be a read-only access to the tuple array into a
> read-write access, potentially dirtying cache lines and causing
> unnecessary memory traffic.
>
> Proposal: On-the-fly Normalization
>
> We can avoid storing current_byte by calculating the relevant byte on
> the fly. While this means re-calculating the byte extraction, we can
> optimize normalize_datum.

That's a interesting idea to consider.

There's one possible disadvantage I can think of: It makes it
difficult to add a fallback qsort specialization that includes the
next "current_byte". How would you teach the qsort comparator to look
at the right "level"?

(Aside: Long term, a cached current byte (or more than one byte) could
also come from something other than the pass-by-value datum. I'm
thinking specifically of a blob of bytes made from normalizing a set
of keys. In that case we won't want to fetch the byte every time. The
pass-by-value datum case can remain independent of that, of course.)

This made me think of something tangential: v4's common prefix
skipping doesn't take into account that the upper 4 bytes can't
matter. With a mix of positive and negative integers, I think it will
do the radix sort on all 8 bytes of "datum1".

> Benefits:
>
> 1. Keeps SortTuple minimal: No risk of struct bloat.

As I mentioned, I don't believe there is any risk at all.

> 2. Read-only passes: The tuple array remains clean in cache during the
> counting phase.

I think the "deal" step stresses the cache and TLB a lot more than
sequential writes would, but reducing overall write traffic is a good
design choice. We'd need to weigh that against the other things I
mentioned.

> 3. Reduced instruction count: We avoid the overhead of full
> normalize_datum calls for lower bytes.

I'm not quite sure that's a net reduction, since every iteration of
the deal step has to at least re-extract the raw byte. An earlier
version stored the normalized datum and extracted the byte as needed,
but that also diverted to a qsort that compared the normalized datum.
There wasn't a big difference in my limited testing, but it could
matter for some inputs.

> I'm happy to help verify this approach if you think it's worth
> pursuing.

I suspect it won't make a big difference, but I could be wrong so feel
free. First let me update the patch with some of your other review
items.

--
John Naylor
Amazon Web Services


Reply via email to