Hi, I've checked the v3 patch against the results of the normalization done by ICU [1] on my test data again, and they're identical (as they were with v1; v2 had the bug discussed upthread, now fixed).
Concerning execution speed, there's an excessive CPU usage when normalizing into NFC or NFKC. Looking at pre-existing code, it looks like recompose_code() in unicode_norm.c looping over the UnicodeDecompMain array might be very costly. Another point is that the ICU-based implementation appears to be significantly faster in all cases, which makes me wonder why ICU builds should not just use ICU instead of the PG-core implementation. To illustrate this, here are the execution times reported by psql for the queries below exercising the normalization code, both with the functions provided by the patch and with the equivalent functions implemented with ICU. The dataset is ~10 million unique short strings extracted from real data, and the number is a median execution time in millisecs, for 10 successive runs with query parallelism off (stddev in parentheses). operation | core | icu ------------+--------------+----------- nfc check | 4398 (20) | 3088 (27) nfc conv | 771502 (414) | 5503 (19) nfd check | 4510 (10) | 2898 (8) nfd conv | 9102 (1) | 5569 (6) nfkc check | 4825 (51) | 3273 (4) nfkc conv | 772240 (340) | 5763 (8) nfkd check | 4794 (4) | 3170 (39) nfkd conv | 9229 (4) | 5824 (9) The queries: check w/core: select count(*) from words where w is $NORM normalized; conversion w/core: select sum(length(normalize(w, $NORM))) from words; check w/icu: select count(*) from words where icu_is_normalized(w, '$NORM'); conversion w/icu: select sum(length(icu_normalize(w, '$NORM'))) from words; [1] https://github.com/dverite/icu_ext/blob/master/icu_normalize.c Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite