Re: Optimization for lower(), upper(), casefold() functions.

Heikki Linnakangas Thu, 30 Jan 2025 14:43:34 -0800

On 30/01/2025 15:39, Alexander Borisov wrote:

The code is fixed, now the patch passes all tests.


Change from the original patch (v1):
Reduce the main table from 3003 to 1677 (duplicates removed) records.
Added records from 0x00 to 0x80 for fast path.
Renamed get_case() function to pg_unicode_case_index().

Benchmark numbers have changed a bit.


Nice results!

Because of the modified approach of record search by table we
managed to:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Reduced the time to find a record in the table.
5. Reduce the size of the final object file.

The approach is generally as follows:
Group Unicode codepoints into ranges in which the difference between
neighboring elements does not exceed the specified limit.
For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
there is a difference of 2 between 3 and 5, which is greater than 1,
so there will be ranges 1-3 and 5-6.

Then we form a table (let's call it an index table) by combining the
obtained ranges. The table contains uint16_t index to the main table.

Then from the previously obtained diapasons we form a function
(get_case()) to get the index to the main table. The function, in fact,
contains only IF/ELSE IF constructs imitating binary search.

Because we are not directly accessing the main table with data, we can
exclude duplicates from it, and there are almost half of them.
Also, because get_case() contains all the information about Unicode
ranges, we don't need to store Unicode codepoints in the main table.
Also because of this approach some checks were removed, which allowed
to increase performance even with fast path (codepoints < 0x80).

Did you consider using a radix tree? We use that method insrc/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's betteror worse than what's proposed here, but it would seem like a morestandard technique at least. Or if this is clearly better, then maybe weshould switch to this technique in convutils.pm too. A perfect hashwould be another alternative, we use that in src/common/unicode_norm.c.

Did you check that these optimizations still win with Unicode version 16(https://www.postgresql.org/message-id/146349e4-4687-4321-91af-f23557249...@eisentraut.org)?We haven't updated to that yet, but sooner or later we will.

The way you're defining 'pg_unicode_case_index' as a function in theheader file won't work. It needs to be a static inline function if it'sin the header. Or put it in a .c file.


Some ideas on how to squeeze this further:

- Instead of having one table that contains Lower/Title/Upper/Fold forevery character, it might be better to have four separate tables. Ithink that would be more cache-friendly: you typically run one of thefunctions for many different characters in a loop, rather than all ofthe functions for the same character. You could deduplicate between thetables too: for many ranges of characters, Title=Upper and Lower=Fold.

- The characters are stored as 4-byte integers, but the high byte isalways 0. Could squeeze those out. Not sure if that would be a win if itmakes the accesses unaligned, but you could benchmark that.Alternatively, use that empty byte to store the 'special_case' index,instead of having a separate field for it.

- Many characters that have a special case only need the special casefor some of the functions, not all. If you stored the special_caseseparately for each function (as the high byte in the 'simplemap' fieldperhaps, like I suggested on previous point), you could avoid havingthose "dummy" special cases.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Optimization for lower(), upper(), casefold() functions.

Reply via email to