Re: Improve the performance of Unicode Normalization Forms.

Alexander Borisov Mon, 22 Dec 2025 09:55:47 -0800

22.12.2025 19:24, Heikki Linnakangas wrote:

On 22/12/2025 17:44, Alexander Borisov wrote:

The branching kept bothering me. The algorithm consisted of a function
with branches (essentially a binary search with offsets) and a table of
indexes. I really wanted to get rid of the branches. And then it dawned
on me that instead of building branches with offsets, I could create
another table that would contain these offsets.
The "inspiration" came from my patch Optimization of the is_normalized()
function [0].


Actually, I changed the GenerateSparseArray algorithm, which now
generates a table with offsets instead of a function with branches.
In other words, we now have two tables:
1. A table with offsets, indicating what offset is needed to obtain the
    desired index from the index table.
2. A table with indexes. The data is divided into fixed ranges.

Sounds like you re-invented the radix tree :-). Nothing wrong with that;a radix tree sounds like a good data structure for this.


I looked at the implementation of radix trees for Encoding in
PostgreSQL. There are similarities, of course, but in my implementation,
the access speed is literally O(1).

For example:
return table_index[ table_offset[cp >> 6] + (cp & 63) ];

Access speed in Encoding (radix tree), it is more complicated. If Iunderstand correctly.


In general, there was an idea to store offset and index data in one
table. Not to separate them.
But I'm sure you'll understand everything when you see the code.

 > The GenerateSparseArray adds comments like "/* U+1234 */" to each
 > element. That's nice but it implies that the elements are unicode code
 > points. GenerateSparseArray could be used for many other things. Let's
 > use "/* 0x1234 */" instead, or make it somehow configurable.
 >
 > The generated file is very large, over 1 MB. I guess it doesn't matter
 > all that much, but perhaps we should generate a little more compact

> code. Maybe we don't need the "/* U+1234 */" comment on every line,but

 > only one comment for each range, for example.

I have many places where I need to generate C tables. To make my life
easier, and perhaps the lives of others, I added a small module for
generating "pretty" tables. ./src/tools/PrettyLine.pm
After applying it, the size of the generated header file was reduced
to ~300KB.

I think a fixed number of values per line would look nicer than tryingto pack as much possible on each line.

(This isn't too important of course, this is just about making thegenerated code look prettier.)


With a sense of inner beauty :), when the number of values in a line is
fixed, the lines start to "dance", which doesn't look very good.
On the other hand, we can convert all values to %06X or something like
that. But then it would defeat the whole purpose of the idea — to reduce
the file size and make it look nice.
I use this approach to create many tables, which allows me not to think
about how the numbers will fit in and not waste time on it.
But, it's more a matter of taste.

Thanks, I'll try to find time to take a closer look.


Thank you for your time.

--
Regards,
Alexander Borisov

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to