On Sun, Nov 23, 2025 at 12:33 PM Chengpeng Yan <[email protected]> wrote: > > > On Nov 12, 2025, at 21:57, John Naylor <[email protected]> wrote: > I have reviewed the v4 patch set. I applied the patches and ran > "make check" on macOS 14.2.1 (M1), and all regression tests passed. > > Overall, the implementation looks solid and the feature is a great > addition. I have a few suggestions regarding code optimization and one > discussion point regarding the data structure design.
Thanks for taking a look! > Minor Comments / Optimizations: > > 1. Optimization in radix_sort_tuple (v4-0001) > > In radix_sort_tuple, there is a check: > > ``` > if (part.offset == part.next_offset) > ``` > > Since "part" is a local copy of the struct, this check might not > reflect the latest state updated inside the loop. It might be slightly > more efficient to check the array directly: > > ``` > if (partitions[idx].offset == partitions[idx].next_offset) > ``` > > This allows us to detect if a partition has been fully consumed/sorted > within the current pass, potentially saving an iteration of the > "while (num_remaining > 1)" loop. I wondered about that and then forgot about it. Thanks for bringing that up, I'll look into it. > 1. Branchless calculation in Common Prefix (v4-0004) > > In sort_byvalue_datum, when calculating the common bits: > > ``` > if (this_common_bits > common_upper_bits) > common_upper_bits = this_common_bits; > ``` > > Since we are looking for the leftmost bit difference, we could > accumulate the differences using bitwise OR. This avoids a conditional > branch inside the loop: > > ``` > common_upper_bits |= this_common_bits; > ``` Good idea. > 3. Short-circuit for identical keys (v4-0004) > > When calculating common_prefix, if common_upper_bits is 0, it implies > that all non-null keys are identical (for the bits we care about). In > this case, we might be able to skip the radix sort entirely or handle > it as a single partition. Currently, the code handles it by passing > "common_upper_bits | 1" to pg_leftmost_one_pos64, which is safe but > perhaps not the most optimal path for identical keys. Right. If all values of the first sort key are the same, v4 still wastes a bit of time counting the lowest byte. This shouldn't happen for abbreviated keys since low cardinality will cause abbreviation to abort, but it's still possible to hit that case if the first key is an integer. > Discussion: SortTuple Layout and normalize_datum > > I have a concern regarding the addition of "uint8 current_byte" to the > SortTuple struct. > > 1. Struct Size and Abstraction: > > SortTuple is the generic atom for the entire sorting module. Adding a > field specific to integer Radix Sort feels like a bit of a leaky > abstraction. I would say the only generic part of SortTuple is the "void* tuple" pointer. Most of the rest of the fields are a cached copy of something extracted from the actual tuple. The exception is "srctape", which is only used for bookkeeping during external merges. I'm not sure I would call srctape a leaky abstraction (I would call this a "fat struct"), but either way we have some precedence for both cached info and for info specific to one mode of sorting. > While it might fit into padding on some 64-bit platforms > (keeping the size at 32 bytes), relying on padding is fragile. I don't see how any platform of any pointer size can fail to have a padding byte since "isnull1" is a boolean. A long time ago we used to allow platforms to have 32-bit booleans, but no longer, so I think it's actually not fragile. (FYI: 24 bytes.) > 2. Cache Invalidation (Read vs. Write): > > Currently, radix_sort_tuple writes to tup->current_byte in every pass. > This turns what could be a read-only access to the tuple array into a > read-write access, potentially dirtying cache lines and causing > unnecessary memory traffic. > > Proposal: On-the-fly Normalization > > We can avoid storing current_byte by calculating the relevant byte on > the fly. While this means re-calculating the byte extraction, we can > optimize normalize_datum. That's a interesting idea to consider. There's one possible disadvantage I can think of: It makes it difficult to add a fallback qsort specialization that includes the next "current_byte". How would you teach the qsort comparator to look at the right "level"? (Aside: Long term, a cached current byte (or more than one byte) could also come from something other than the pass-by-value datum. I'm thinking specifically of a blob of bytes made from normalizing a set of keys. In that case we won't want to fetch the byte every time. The pass-by-value datum case can remain independent of that, of course.) This made me think of something tangential: v4's common prefix skipping doesn't take into account that the upper 4 bytes can't matter. With a mix of positive and negative integers, I think it will do the radix sort on all 8 bytes of "datum1". > Benefits: > > 1. Keeps SortTuple minimal: No risk of struct bloat. As I mentioned, I don't believe there is any risk at all. > 2. Read-only passes: The tuple array remains clean in cache during the > counting phase. I think the "deal" step stresses the cache and TLB a lot more than sequential writes would, but reducing overall write traffic is a good design choice. We'd need to weigh that against the other things I mentioned. > 3. Reduced instruction count: We avoid the overhead of full > normalize_datum calls for lower bytes. I'm not quite sure that's a net reduction, since every iteration of the deal step has to at least re-extract the raw byte. An earlier version stored the normalized datum and extracted the byte as needed, but that also diverted to a qsort that compared the normalized datum. There wasn't a big difference in my limited testing, but it could matter for some inputs. > I'm happy to help verify this approach if you think it's worth > pursuing. I suspect it won't make a big difference, but I could be wrong so feel free. First let me update the patch with some of your other review items. -- John Naylor Amazon Web Services
