Re: Unicode normalization SQL functions

Daniel Verite Mon, 17 Feb 2020 11:09:09 -0800

  Hi,

I've checked the v3 patch against the results of the normalization
done by ICU [1] on my test data again, and they're identical
(as they were with v1; v2 had the bug discussed upthread, now fixed).


Concerning execution speed, there's an excessive CPU usage when
normalizing into NFC or NFKC. Looking at pre-existing code, it looks
like recompose_code() in unicode_norm.c looping over the
UnicodeDecompMain array might be very costly.

Another point is that the ICU-based implementation appears
to be significantly faster in all cases, which makes me wonder
why ICU builds should not just use ICU instead of the PG-core
implementation.
To illustrate this, here are the execution times reported by psql for
the queries below exercising the normalization code, both with the
functions provided by the patch and with the equivalent functions
implemented with ICU.
The dataset is ~10 million unique short strings
extracted from real data, and the number is a median execution time in
millisecs, for 10 successive runs with query parallelism off
(stddev in parentheses).

 operation  |    core      |    icu    
------------+--------------+-----------
 nfc check  | 4398 (20)    | 3088 (27)
 nfc conv   | 771502 (414) | 5503 (19)
 nfd check  | 4510 (10)    | 2898 (8)
 nfd conv   | 9102 (1)     | 5569 (6)
 nfkc check | 4825 (51)    | 3273 (4)
 nfkc conv  | 772240 (340) | 5763 (8)
 nfkd check | 4794 (4)     | 3170 (39)
 nfkd conv  | 9229 (4)     | 5824 (9)

The queries:

check w/core:
  select count(*) from words where w is $NORM normalized;

conversion w/core:
  select sum(length(normalize(w, $NORM))) from words;

check w/icu:
  select count(*) from words where icu_is_normalized(w, '$NORM');

conversion w/icu:
  select sum(length(icu_normalize(w, '$NORM'))) from words;


[1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unicode normalization SQL functions

Reply via email to