Re: Built-in CTYPE provider

Jeremy Schneider Mon, 08 Jan 2024 17:18:07 -0800

On 12/28/23 6:57 PM, Jeff Davis wrote:
> 
> Attached a more complete version that fixes a few bugs, stabilizes the
> tests, and improves the documentation. I optimized the performance, too
> -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
> collation and case mapping (numbers below).
> 
> It's really nice to finally be able to have platform-independent tests
> that work on any UTF-8 database.

Thanks for all your work on this, Jeff

I didn't know about the Unicode stability policy. Since it's formal
policy, I agree this provides some assumptions we can safely build on.

I'm working my way through these patches but it's taking a little time
for me. I hadn't tracked with the "builtin" thread last summer so I'm
coming up to speed on that now too. I'm hopeful that something along
these lines gets into pg17. The pg17 cycle is going to start heating up
pretty soon.

I agree with merging the threads, even though it makes for a larger
patch set. It would be great to get a unified "builtin" provider in
place for the next major.

I also still want to parse my way through your email reply about the two
groups of callers, and what this means for user experience.

https://www.postgresql.org/message-id/7774b3a64f51b3375060c29871cf2b02b3e85dab.camel%40j-davis.com

> Let's separate it into groups.
> (1) Callers that use a collation OID or pg_locale_t:
> (2) A long tail of callers that depend on what LC_CTYPE/LC_COLLATE are
> set to, or use ad-hoc ASCII-only semantics:

In the first list it seems that some callers might be influenced by a
COLLATE clause or table definition while others always take the database
default? It still seems a bit odd to me if different providers can be
used for different parts of a single SQL. But it might not be so bad - I
haven't fully thought through it yet and I'm still kicking the tires on
my test build over here.

Is there any reason we couldn't commit the minor cleanup (patch 0001)
now? It's less than 200 lines and pretty straightforward.

I wonder if, after a year of running the builtin provider in production,
whether we might consider adding to the builtin provider a few locales
with simple but more reasonable ordering for european and asian
languages? Maybe just grabbing a current version of DUCET with no
intention of ever updating it? I don't know how bad sorting is with
plain DUCET for things like french or spanish or german, but surely it's
not as unusable as code point order? Anyone who needs truly accurate or
updated or customized linguistic sorting can always go to ICU, and take
responsibility for their ICU upgrades, but something basic and static
might meet the needs of 99% of postgres users indefinitely.

By the way - my coworker Josh (who I don't think posts much on the
hackers list here, but shares an unhealthy inability to look away from
database unicode disasters) passed along this link today which I think
is a fantastic list of surprises about programming and strings
(generally unicode).

https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main

Make sure to click the link to show the counterexamples and discussion,
that's the best part.

-Jeremy

PS. I was joking around today that the the second best part is that it's
proof that people named Jeremy are always brilliant within their field.
😂 Josh said its just a subset of "always trust people whose names start
with J" which seems fair. Unfortunately I can't yet think of a way to
shoehorn the rest of the amazing PG hackers on this thread into the joke.

--
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

Reply via email to