On 24/11/2025 11:55, Alexander Borisov wrote:
Hey guys, any news?

I finally got around to look at this, thanks for your patience!

v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch

+1, this makes the existing code more readable, even without the rest of the patches. Thanks for the added comments. I'll review this in more detail, but it seems pretty close to be ready for committing.

Thinking of GenerateSparseArray's API, I think what the caller really wants is to generate a function like:

/*
 * Look up the value of 'x'.
 */
static uint16
lookup(uint16)
{
   ...
}

That's how the PerfectHash module works. The tables should be implementation detail that the caller doesn't need to know about.


v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch

These basically look OK to me too. Some minor issues and ideas for further work:

The 'make update-unicode' and 'ninja update-unicode' targets are broken. Need to be updated for the removal of 'unicode_norm_hashfunc.h'.

The GenerateSparseArray adds comments like "/* U+1234 */" to each element. That's nice but it implies that the elements are unicode code points. GenerateSparseArray could be used for many other things. Let's use "/* 0x1234 */" instead, or make it somehow configurable.

The generated file is very large, over 1 MB. I guess it doesn't matter all that much, but perhaps we should generate a little more compact code. Maybe we don't need the "/* U+1234 */" comment on every line, but only one comment for each range, for example.

typedef struct
{
        uint8           comb_class;             /* combining class of character 
*/
        uint8           dec_size_flags; /* size and flags of decomposition code 
list */
        uint16          dec_index;              /* index into 
UnicodeDecomp_codepoints, or the
                                                                 * 
decomposition itself if DECOMP_INLINE */
} pg_unicode_decomposition;

The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn't exist anymore.


Finally, some ideas for packing the arrays more tightly. I'm not sure if these make any difference in practice or are worth the effort, but here we go:

There are only 56 distinct comb_classes, and 18 distinct dec_size_flags, so if we added one more small lookup table for them, they could be packed into a single byte. That would shrink pg_unicode_decomposition from 4 to 3 bytes.

static const uint8 UnicodeDecompSizes[4931] =

The max value stored here is 7, so you could get away with just 3 bits per element.

static const uint16 decomp_map[33752] =

This array consists of multiple ranges, and each range is accessed in a separate piece of code. We could use multiple arrays, one for each range, instead of one big array. Some of the ranges only store a small range of values, so we could use uint8 for them.

- Heikki


Reply via email to