Re: Improve the performance of Unicode Normalization Forms.

Alexander Borisov Mon, 11 Aug 2025 07:21:42 -0700

09.08.2025 02:17, Jeff Davis пишет:

On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:

Version 3 patches. In version 2 "make -s headerscheck" did not work.


I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:


[..]

Results with perfect hashing (100 iterations):

   Normalization from NFC to  NFD with  PG: 010.009
   Normalization from NFC to  NFD with ICU: 001.580
   Normalization from NFC to NFKD with  PG: 009.376
   Normalization from NFC to NFKD with ICU: 000.857
   Normalization from NFD to  NFC with  PG: 016.026
   Normalization from NFD to  NFC with ICU: 001.205
   Normalization from NFD to NFKC with  PG: 015.903
   Normalization from NFD to NFKC with ICU: 000.654

Results with your code (100 iterations):

   Normalization from NFC to  NFD with  PG: 004.626
   Normalization from NFC to  NFD with ICU: 001.577
   Normalization from NFC to NFKD with  PG: 004.024
   Normalization from NFC to NFKD with ICU: 000.861
   Normalization from NFD to  NFC with  PG: 006.846
   Normalization from NFD to  NFC with ICU: 001.209
   Normalization from NFD to NFKC with  PG: 006.655
   Normalization from NFD to NFKC with ICU: 000.651

Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.


I had some experimented a bit, and took a look at the ICU code.

1. The performance test is not entirely accurate.

The thing is that in ICU (unorm_normalize()), we pass the output
buffer and its size.
That is, ICU does not calculate how much data we will have at the
output; we pass it our buffer. ICU simply checks whether the data
fits into the buffer or not.

In our case, PG (unicode_normalize()) only accepts the input buffer
and first calculates the size of the output buffer.
This means we are doing double the work, as we have already gone
through the input data at least once too many times.

Here, it would be more honest to call the function for calculating
the buffer in ICU, and only then call data normalization.

2. In PG, unnecessary table lookups (get_code_entry()) that can
   clearly be avoided.
3. In PG, the algorithm is not optimized.

We could eliminate the decompose_code() function, which is called
for each code point.
In this function, we can remove recursion and prepare the data in
advance. That is, we can perform data expansion in the Perl script.
If we do this, we will have code that is in place and without recursion.
And then there are all sorts of other minor details.

Of course, there are other approaches.

I did this comparison (your test, Jeff):

1. Without patch.

Normalization from NFC to  NFD with  PG: 009.121
Normalization from NFC to  NFD with ICU: 000.954
Normalization from NFC to NFKD with  PG: 009.048
Normalization from NFC to NFKD with ICU: 000.965
Normalization from NFD to  NFC with  PG: 014.525
Normalization from NFD to  NFC with ICU: 000.623
Normalization from NFD to NFKC with  PG: 014.380
Normalization from NFD to NFKC with ICU: 000.666

2. With patch.

Normalization from NFC to  NFD with  PG: 005.743
Normalization from NFC to  NFD with ICU: 001.005
Normalization from NFC to NFKD with  PG: 005.902
Normalization from NFC to NFKD with ICU: 000.963
Normalization from NFD to  NFC with  PG: 007.837
Normalization from NFD to  NFC with ICU: 000.640
Normalization from NFD to NFKC with  PG: 008.036
Normalization from NFD to NFKC with ICU: 000.667

3. With a patch, but! With direct access to tables, i.e., I created one
   large table (index table, unit16_t) where there is no search, direct
   access to data.

Normalization from NFC to  NFD with  PG: 002.792
Normalization from NFC to  NFD with ICU: 000.953
Normalization from NFC to NFKD with  PG: 002.865
Normalization from NFC to NFKD with ICU: 000.958
Normalization from NFD to  NFC with  PG: 004.361
Normalization from NFD to  NFC with ICU: 000.651
Normalization from NFD to NFKC with  PG: 004.474
Normalization from NFD to NFKC with ICU: 000.668

It is evident that direct access provides x2 from the patch, but adds
+1.5MB to the object file size.
This is just to check the time difference in access.

As a result, I would not look into ICU at the moment, especially since
we have a different approach.
I am currently working on optimizing unicode_normalize().
I am trying to come up with an improved version of the algorithm in C
by the next commitfest (which will be in September).

P.S.: In general, I strive to surpass ICU in terms of speed.
I think we're making good progress. Let's see what happens.


--
Regards,
Alexander Borisov

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to