On 22/12/2025 17:44, Alexander Borisov wrote:
The branching kept bothering me. The algorithm consisted of a function
with branches (essentially a binary search with offsets) and a table of
indexes. I really wanted to get rid of the branches. And then it dawned
on me that instead of building branches with offsets, I could create
another table that would contain these offsets.
The "inspiration" came from my patch Optimization of the is_normalized()
function [0].
Actually, I changed the GenerateSparseArray algorithm, which now
generates a table with offsets instead of a function with branches.
In other words, we now have two tables:
1. A table with offsets, indicating what offset is needed to obtain the
desired index from the index table.
2. A table with indexes. The data is divided into fixed ranges.
Sounds like you re-invented the radix tree :-). Nothing wrong with that;
a radix tree sounds like a good data structure for this.
> The GenerateSparseArray adds comments like "/* U+1234 */" to each
> element. That's nice but it implies that the elements are unicode code
> points. GenerateSparseArray could be used for many other things. Let's
> use "/* 0x1234 */" instead, or make it somehow configurable.
>
> The generated file is very large, over 1 MB. I guess it doesn't matter
> all that much, but perhaps we should generate a little more compact
> code. Maybe we don't need the "/* U+1234 */" comment on every line, but
> only one comment for each range, for example.
I have many places where I need to generate C tables. To make my life
easier, and perhaps the lives of others, I added a small module for
generating "pretty" tables. ./src/tools/PrettyLine.pm
After applying it, the size of the generated header file was reduced
to ~300KB.
I think a fixed number of values per line would look nicer than trying
to pack as much possible on each line.
(This isn't too important of course, this is just about making the
generated code look prettier.)
Thanks, I'll try to find time to take a closer look.
- Heikki