On 30/01/2025 15:39, Alexander Borisov wrote:
The code is fixed, now the patch passes all tests.

Change from the original patch (v1):
Reduce the main table from 3003 to 1677 (duplicates removed) records.
Added records from 0x00 to 0x80 for fast path.
Renamed get_case() function to pg_unicode_case_index().

Benchmark numbers have changed a bit.

Nice results!

Because of the modified approach of record search by table we
managed to:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Reduced the time to find a record in the table.
5. Reduce the size of the final object file.

The approach is generally as follows:
Group Unicode codepoints into ranges in which the difference between
neighboring elements does not exceed the specified limit.
For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
there is a difference of 2 between 3 and 5, which is greater than 1,
so there will be ranges 1-3 and 5-6.

Then we form a table (let's call it an index table) by combining the
obtained ranges. The table contains uint16_t index to the main table.

Then from the previously obtained diapasons we form a function
(get_case()) to get the index to the main table. The function, in fact,
contains only IF/ELSE IF constructs imitating binary search.

Because we are not directly accessing the main table with data, we can
exclude duplicates from it, and there are almost half of them.
Also, because get_case() contains all the information about Unicode
ranges, we don't need to store Unicode codepoints in the main table.
Also because of this approach some checks were removed, which allowed
to increase performance even with fast path (codepoints < 0x80).

Did you consider using a radix tree? We use that method in src/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's better or worse than what's proposed here, but it would seem like a more standard technique at least. Or if this is clearly better, then maybe we should switch to this technique in convutils.pm too. A perfect hash would be another alternative, we use that in src/common/unicode_norm.c.

Did you check that these optimizations still win with Unicode version 16 (https://www.postgresql.org/message-id/146349e4-4687-4321-91af-f23557249...@eisentraut.org)? We haven't updated to that yet, but sooner or later we will.

The way you're defining 'pg_unicode_case_index' as a function in the header file won't work. It needs to be a static inline function if it's in the header. Or put it in a .c file.

Some ideas on how to squeeze this further:

- Instead of having one table that contains Lower/Title/Upper/Fold for every character, it might be better to have four separate tables. I think that would be more cache-friendly: you typically run one of the functions for many different characters in a loop, rather than all of the functions for the same character. You could deduplicate between the tables too: for many ranges of characters, Title=Upper and Lower=Fold.

- The characters are stored as 4-byte integers, but the high byte is always 0. Could squeeze those out. Not sure if that would be a win if it makes the accesses unaligned, but you could benchmark that. Alternatively, use that empty byte to store the 'special_case' index, instead of having a separate field for it.

- Many characters that have a special case only need the special case for some of the functions, not all. If you stored the special_case separately for each function (as the high byte in the 'simplemap' field perhaps, like I suggested on previous point), you could avoid having those "dummy" special cases.

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to