Re: Improve the performance of Unicode Normalization Forms.

Alexander Borisov Thu, 14 Aug 2025 07:41:58 -0700

14.08.2025 01:12, Jeff Davis wrote:

On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote:


[..]


Comments on the patch itself:

The 0001 patch generalizes the two-step lookup process: first navigate
branches to find the index into a partially-compacted sparse array, and
then use that to get the index into the dense array. The branching
code, the sparse array, and the dense array are all generated code. The
reason for the two-step lookup is to keep the sparse array element size
small (uint16), so that missing elements consume less space (missing
elements still remain for small gaps). The full entry is kept in the
dense array.


Yes, that's exactly right.

GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for
the new module. And we should probably include "sparse" in the name of
the sparse arrays.


I don't mind, that's good.

The new module is responsible for generating the branching code as well
as the sparse array; while the caller is reponsible for the dense
arrays. For case mapping, one sparse array is used for four parallel
arrays for the different case kinds (lower/title/upper/fold).


Yes, that's right.

The use of zero values for different purposes is getting confusing. It
means "doesn't exist", but we are also reserving the zeroth element in
the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then
have the caller check for it? That way we don't need to reserve the
zeroth array element, which should make it easier to avoid off-by-one
errors.


Initially, that was the case;
However, I subsequently concluded that creating magical values to define
an element that was not found was not an ideal solution.
I felt that 0 was easier to understand.

I'm not entirely sure now, but maybe this will even help get rid of
the comparison to 0. That is, we can get an element from the main table
by index 0. It will be zeroed, and the algorithm will work as it should.
However, it is not certain that this will have any effect on
performance.

In general, I don't have a firm position on this issue.

I think we can simplify the interface, as well. Why does the caller
need to separately generate the ranges, then generate the table, then
generate the branches? It's really all the same action and can be based
on an input hash with a certain structure, and then return both the
table and the branches, right?


Yes, it can be done in one go. But the separation was done
intentionally. It makes the code more understandable, and, more
importantly, it makes it easier for developers "from the street"
to understand it and, if necessary, correct or supplement the algorithm.
These are utilities, code for generating an algorithm that, let's say,
will not be used very often, and I would not "compress" it. Here, it is
more important for us to understand how it works.


--
Regards,
Alexander Borisov

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to