09.08.2025 02:17, Jeff Davis пишет:
On Tue, 2025-07-08 at 22:42 +0300, Alexander Borisov wrote:
Version 3 patches. In version 2 "make -s headerscheck" did not work.
I ran my own performance tests. What I did was get some test data from
ICU v76.1 by doing:
[..]
Results with perfect hashing (100 iterations):
Normalization from NFC to NFD with PG: 010.009
Normalization from NFC to NFD with ICU: 001.580
Normalization from NFC to NFKD with PG: 009.376
Normalization from NFC to NFKD with ICU: 000.857
Normalization from NFD to NFC with PG: 016.026
Normalization from NFD to NFC with ICU: 001.205
Normalization from NFD to NFKC with PG: 015.903
Normalization from NFD to NFKC with ICU: 000.654
Results with your code (100 iterations):
Normalization from NFC to NFD with PG: 004.626
Normalization from NFC to NFD with ICU: 001.577
Normalization from NFC to NFKD with PG: 004.024
Normalization from NFC to NFKD with ICU: 000.861
Normalization from NFD to NFC with PG: 006.846
Normalization from NFD to NFC with ICU: 001.209
Normalization from NFD to NFKC with PG: 006.655
Normalization from NFD to NFKC with ICU: 000.651
Your patches are a major improvement, but I'm trying to figure out why
ICU still wins by so much. Thoughts? I didn't investigate much myself
yet, so it's quite possible there's a bug in my test or something.
I had some experimented a bit, and took a look at the ICU code.
1. The performance test is not entirely accurate.
The thing is that in ICU (unorm_normalize()), we pass the output
buffer and its size.
That is, ICU does not calculate how much data we will have at the
output; we pass it our buffer. ICU simply checks whether the data
fits into the buffer or not.
In our case, PG (unicode_normalize()) only accepts the input buffer
and first calculates the size of the output buffer.
This means we are doing double the work, as we have already gone
through the input data at least once too many times.
Here, it would be more honest to call the function for calculating
the buffer in ICU, and only then call data normalization.
2. In PG, unnecessary table lookups (get_code_entry()) that can
clearly be avoided.
3. In PG, the algorithm is not optimized.
We could eliminate the decompose_code() function, which is called
for each code point.
In this function, we can remove recursion and prepare the data in
advance. That is, we can perform data expansion in the Perl script.
If we do this, we will have code that is in place and without recursion.
And then there are all sorts of other minor details.
Of course, there are other approaches.
I did this comparison (your test, Jeff):
1. Without patch.
Normalization from NFC to NFD with PG: 009.121
Normalization from NFC to NFD with ICU: 000.954
Normalization from NFC to NFKD with PG: 009.048
Normalization from NFC to NFKD with ICU: 000.965
Normalization from NFD to NFC with PG: 014.525
Normalization from NFD to NFC with ICU: 000.623
Normalization from NFD to NFKC with PG: 014.380
Normalization from NFD to NFKC with ICU: 000.666
2. With patch.
Normalization from NFC to NFD with PG: 005.743
Normalization from NFC to NFD with ICU: 001.005
Normalization from NFC to NFKD with PG: 005.902
Normalization from NFC to NFKD with ICU: 000.963
Normalization from NFD to NFC with PG: 007.837
Normalization from NFD to NFC with ICU: 000.640
Normalization from NFD to NFKC with PG: 008.036
Normalization from NFD to NFKC with ICU: 000.667
3. With a patch, but! With direct access to tables, i.e., I created one
large table (index table, unit16_t) where there is no search, direct
access to data.
Normalization from NFC to NFD with PG: 002.792
Normalization from NFC to NFD with ICU: 000.953
Normalization from NFC to NFKD with PG: 002.865
Normalization from NFC to NFKD with ICU: 000.958
Normalization from NFD to NFC with PG: 004.361
Normalization from NFD to NFC with ICU: 000.651
Normalization from NFD to NFKC with PG: 004.474
Normalization from NFD to NFKC with ICU: 000.668
It is evident that direct access provides x2 from the patch, but adds
+1.5MB to the object file size.
This is just to check the time difference in access.
As a result, I would not look into ICU at the moment, especially since
we have a different approach.
I am currently working on optimizing unicode_normalize().
I am trying to come up with an improved version of the algorithm in C
by the next commitfest (which will be in September).
P.S.: In general, I strive to surpass ICU in terms of speed.
I think we're making good progress. Let's see what happens.
--
Regards,
Alexander Borisov