Re: Improve the performance of Unicode Normalization Forms.

Heikki Linnakangas Fri, 12 Dec 2025 04:22:56 -0800

On 24/11/2025 11:55, Alexander Borisov wrote:

Hey guys, any news?


I finally got around to look at this, thanks for your patience!

v6-0001-Moving-Perl-functions-Sparse-Array-to-a-common-mo.patch

+1, this makes the existing code more readable, even without the rest ofthe patches. Thanks for the added comments. I'll review this in moredetail, but it seems pretty close to be ready for committing.

Thinking of GenerateSparseArray's API, I think what the caller reallywants is to generate a function like:


/*
 * Look up the value of 'x'.
 */
static uint16
lookup(uint16)
{
   ...
}

That's how the PerfectHash module works. The tables should beimplementation detail that the caller doesn't need to know about.

v6-0002-Improve-the-performance-of-Unicode-Normalization-.patch
v6-0003-Refactoring-Unicode-Normalization-Forms-performan.patch

These basically look OK to me too. Some minor issues and ideas forfurther work:

The 'make update-unicode' and 'ninja update-unicode' targets are broken.Need to be updated for the removal of 'unicode_norm_hashfunc.h'.

The GenerateSparseArray adds comments like "/* U+1234 */" to eachelement. That's nice but it implies that the elements are unicode codepoints. GenerateSparseArray could be used for many other things. Let'suse "/* 0x1234 */" instead, or make it somehow configurable.

The generated file is very large, over 1 MB. I guess it doesn't matterall that much, but perhaps we should generate a little more compactcode. Maybe we don't need the "/* U+1234 */" comment on every line, butonly one comment for each range, for example.

typedef struct
{
        uint8           comb_class;             /* combining class of character 
*/
        uint8           dec_size_flags; /* size and flags of decomposition code 
list */
        uint16          dec_index;              /* index into 
UnicodeDecomp_codepoints, or the
                                                                 * 
decomposition itself if DECOMP_INLINE */
} pg_unicode_decomposition;

The 'UnicodeDecomp_codepoints' array mentioned in the comment doesn'texist anymore.

Finally, some ideas for packing the arrays more tightly. I'm not sure ifthese make any difference in practice or are worth the effort, but herewe go:

There are only 56 distinct comb_classes, and 18 distinct dec_size_flags,so if we added one more small lookup table for them, they could bepacked into a single byte. That would shrink pg_unicode_decompositionfrom 4 to 3 bytes.

static const uint8 UnicodeDecompSizes[4931] =

The max value stored here is 7, so you could get away with just 3 bitsper element.

static const uint16 decomp_map[33752] =

This array consists of multiple ranges, and each range is accessed in aseparate piece of code. We could use multiple arrays, one for eachrange, instead of one big array. Some of the ranges only store a smallrange of values, so we could use uint8 for them.


- Heikki

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to