On Fri, 20 Dec 2024 at 10:50, Andreas Karlsson <andr...@proxel.se> wrote: > > Hi, > > Jeff pointed out to me that the case conversion functions in ICU have > UTF-8 specific versions which means we can call those directly if the > database encoding is UTF-8 and skip having to convert to and from UChar. > > Since most people today run their databases in UTF-8 I think this > optimization is worth it and when measuring on short to medium length > strings I got a 15-20% speed up. It is still slower than glibc in my > benchmarks but the gap is smaller now. > > SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE > "sv-SE-x-icu") FROM generate_series(1, 1000000) i); > > master: ~540 ms > Patched: ~460 ms > glibc: ~410 ms > > I have also attached a clean up patch for the non-UTF-8 code paths. I > thought about doing the same for the new UTF-8 code paths but it turned > out to be a bit messy due to different function signatures for > ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().
I noticed that Jeff's comments from [1] have not yet been addressed, I have changed the commitfest entry status to "Waiting on Author", please address them and update it to "Needs Review". [1] - https://www.postgresql.org/message-id/72c7c2b5848da44caddfe0f20f6c7ebc7c0c6e60.ca...@j-davis.com Regards, Vignesh