On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote: > I agree with merging the threads, even though it makes for a larger > patch set. It would be great to get a unified "builtin" provider in > place for the next major.
I believe that's possible and that this proposal is quite close (hoping to get something in this 'fest). The tables I'm introducing have exhaustive test coverage, so there's not a lot of risk there. And the builtin provider itself is an optional feature, so it won't be disruptive. > > In the first list it seems that some callers might be influenced by a > COLLATE clause or table definition while others always take the > database > default? It still seems a bit odd to me if different providers can be > used for different parts of a single SQL. Right, that can happen today, and my proposal doesn't change that. Basically those are cases where the caller was never properly onboarded to our collation system, like the ts_locale.c routines. > Is there any reason we couldn't commit the minor cleanup (patch 0001) > now? It's less than 200 lines and pretty straightforward. Sure, I'll commit that fairly soon then. > I wonder if, after a year of running the builtin provider in > production, > whether we might consider adding to the builtin provider a few > locales > with simple but more reasonable ordering for european and asian > languages? I won't rule that out completely, but there's a lot we would need to do to get there. Even assuming we implement that perfectly, we'd need to make sure it's a reasonable scope for Postgres as a project and that we have more than one person willing to maintain it. Similar things have been rejected before for similar reasons. What I'm proposing for v17 is much simpler: basically some lookup tables, which is just an extension of what we're already doing for normalization. > https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html#main > > Make sure to click the link to show the counterexamples and > discussion, > that's the best part. Yes, it can be hard to reason about this stuff but I believe Unicode provides a lot of good data and guidance to work from. If you think my proposal relies on one of those assumptions let me know. To the extent that I do rely on any of those assumptions, it's mostly to match libc's "C.UTF-8" behavior. Regards, Jeff Davis