On 12/28/23 6:57 PM, Jeff Davis wrote: > > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping (numbers below). > > It's really nice to finally be able to have platform-independent tests > that work on any UTF-8 database.
Thanks for all your work on this, Jeff I didn't know about the Unicode stability policy. Since it's formal policy, I agree this provides some assumptions we can safely build on. I'm working my way through these patches but it's taking a little time for me. I hadn't tracked with the "builtin" thread last summer so I'm coming up to speed on that now too. I'm hopeful that something along these lines gets into pg17. The pg17 cycle is going to start heating up pretty soon. I agree with merging the threads, even though it makes for a larger patch set. It would be great to get a unified "builtin" provider in place for the next major. I also still want to parse my way through your email reply about the two groups of callers, and what this means for user experience. https://www.postgresql.org/message-id/7774b3a64f51b3375060c29871cf2b02b3e85dab.camel%40j-davis.com > Let's separate it into groups. > (1) Callers that use a collation OID or pg_locale_t: > (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are > set to, or use ad-hoc ASCII-only semantics: In the first list it seems that some callers might be influenced by a COLLATE clause or table definition while others always take the database default? It still seems a bit odd to me if different providers can be used for different parts of a single SQL. But it might not be so bad - I haven't fully thought through it yet and I'm still kicking the tires on my test build over here. Is there any reason we couldn't commit the minor cleanup (patch 0001) now? It's less than 200 lines and pretty straightforward. I wonder if, after a year of running the builtin provider in production, whether we might consider adding to the builtin provider a few locales with simple but more reasonable ordering for european and asian languages? Maybe just grabbing a current version of DUCET with no intention of ever updating it? I don't know how bad sorting is with plain DUCET for things like french or spanish or german, but surely it's not as unusable as code point order? Anyone who needs truly accurate or updated or customized linguistic sorting can always go to ICU, and take responsibility for their ICU upgrades, but something basic and static might meet the needs of 99% of postgres users indefinitely. By the way - my coworker Josh (who I don't think posts much on the hackers list here, but shares an unhealthy inability to look away from database unicode disasters) passed along this link today which I think is a fantastic list of surprises about programming and strings (generally unicode). https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main Make sure to click the link to show the counterexamples and discussion, that's the best part. -Jeremy PS. I was joking around today that the the second best part is that it's proof that people named Jeremy are always brilliant within their field. 😂 Josh said its just a subset of "always trust people whose names start with J" which seems fair. Unfortunately I can't yet think of a way to shoehorn the rest of the amazing PG hackers on this thread into the joke. -- http://about.me/jeremy_schneider