On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <[email protected]> wrote: > > Hi, > > Jeff pointed out to me that the case conversion functions in ICU have > UTF-8 specific versions which means we can call those directly if the > database encoding is UTF-8 and skip having to convert to and from UChar. > > Since most people today run their databases in UTF-8 I think this > optimization is worth it and when measuring on short to medium length > strings I got a 15-20% speed up. It is still slower than glibc in my > benchmarks but the gap is smaller now. > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > master: ~540 ms > Patched: ~460 ms > glibc: ~410 ms > > I have also attached a clean up patch for the non-UTF-8 code paths. I > thought about doing the same for the new UTF-8 code paths but it turned > out to be a bit messy due to different function signatures for > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
I noticed that Jeff's comments from [1] have not yet been addressed, I have changed the commitfest entry status to "Waiting on Author", please address them and update it to "Needs Review". [1] - https://www.postgresql.org/message-id/[email protected] Regards, Vignesh
