Hey Jeremy,

On Tue, Jun 7, 2022 at 12:42 PM Jeremy Schneider
<schnei...@ardentperf.com> wrote:
> Thomas - thanks for the link back to one of the threads. I spent some time 
> reading through that and it’s a lot of material; I haven’t read the whole 
> thread yet. If you have some others that would also be particularly good 
> background, let me know. I’m doing a chunk of this in my spare time at the 
> moment, but I do want to keep getting more up to speed. I was pulled into a 
> bunch of various things related to PostgreSQL and ICU and collation and OS’s 
> over the past couple years, so I learned a lot from on-the-ground experience 
> and I am interested in trying to get a little more involved in the 
> conversation here.

There were more threads, but they mostly say the same things, hence my
current attempt to move from bloviation to trying out the ideas with
actual code :-D

> Personally, I really do think there should at least be an *option* to tell 
> the DB to fully error rather than just warn on version mismatch. Correctness 
> matters to many users, and being able to *trust* string comparisons are 
> correct is pretty damn fundamental all throughout a database. It really 
> doesn’t get any more basic and the potential for bad things to happen is 
> pretty astronomical, if you can’t trust those. I understand the consternation 
> about dealing with upgrades of large & busy databases, but I’m still 
> surprised that the community consensus arrived at the present behavior, and I 
> have a lot of reading to do, to really understand how that happened and where 
> the dialogue is today.

Given that the only thing you could do about it is REINDEX, and yet we
don't even know which indexes needed to be REINDEXed (the problem
Julien and I tried to address, but so far without success), it seemed
highly premature to convert the warning to an error.

I don't think the community consensus is that we have arrived
somewhere, it's more like we're in transit, possibly without a map.
For example 15 gains ICU support for the default collation (= how most
people consume collations), which changes things, and will surely lead
to more people thinking about this problem space.

> Multiple versions of ICU sounds nice for users who need real linguistic 
> collation (like what Oracle and DB2 offer), but I still feel like there needs 
> to be a super simple basic “pseudo-linguistic” collation baked in, that’s 
> “good enough” for 99% of users and that is guaranteed to be the same 
> everywhere on every platform and just won’t ever change. I think glibc needs 
> to be phased out somehow. At a minimum, not the default for new users… to 
> stop the bleeding. If MySQL wasn’t GPL then I’d say to just copy their 
> collations. I’d be reluctant to spend too much time on a POC now though, it 
> feels like my idea is the outlier and the general PG hacker consensus would 
> be to reject this idea. (But maybe I’m wrong?)

Hmm.  Well I personally wouldn't try to write a collator any more
willingly than I'd try to write a new cryptographic algorithm, just
not my bag.  We don't want to handle complaints about our sort order
(we already bat away complaints about glibc's, and I heard an account
from an OS vendor about the non-stop contradictory crank complaints
about sort order they used to receive before they gave up and adopted
ICU).  ICU really is quite authoritative here.

If you mean that you don't even want to have to specify a language
like "en", then note that you don't have to: ICU has a "root" collator
which you can request with an empty string (all other collators apply
cultural tweaks on top of that).  Like everything else, the root
collator has changed over time, though.

With my "distinct" experimental patch (v4), you could set your
database default collation to a specific ICU major version's root
collator like so:

create database ... locale_provider = icu icu_locale = '71:' template
= template0

That'll keep working, even across pg_upgrades to some version of
PostgreSQL far in the future that is linked against ICU 100, by
dlopen'ing the .71 libraries, for as long as you can get your hands on
a libicu71 package or otherwise build your own, and it'll error out if
it can't open that library, which may be the hard error you were
looking for.  If there's an API change in ICU we'll have to make some
changes, but that's already true.

Review/testing/flames/rants/better ideas welcome.

> Anyway, again, apologies for my pants-on-fire email last week. I hope I can 
> enjoy a few beers someday - or coffee for the non-drinkers - with a few other 
> PG collation nerds (which I never set out to be, but it may have befallen me 
> <g>).

+1


Reply via email to