On 11/28/23 2:12 AM, Daniel Verite wrote: > Jeremy Schneider wrote: >> 1) "collation changes are uncommon" (which is relatively correct) >> 2) "most users would rather have ease-of-use than 100% safety, since >> it's uncommon" >> >> And I think this led to the current behavior of issuing a warning rather >> than an error, > There's a technical reason for this being a warning. > If it was an error, any attempt to do anything with the collation > would fail, which includes REINDEX on indexes using > that collation. > And yet that's precisely what you're supposed to do in that > situation.
Indexes are the most obvious and impactful corruption, so the focus is understandable, but there's a bit of a myth in the general public that REINDEX means you fixed your database. I'm concerned that too many people believe this falsehood, and don't realize that things like constraints and partitions can also be affected by a major OS update when leaving PG data files in place. Also there's a tendancy to use amcheck and validate btree indexes, but skip other index types. And of course none of this is possible when people mistakenly use a different major OS for the hot standby (but Postgres willingly sends incorrect query results to users). This is why my original proposal included an update to the ALTER ... REFRESH/COLLATION docs. Today's conventional wisdom suggests this is a safe command. It's really not, if you're using unicode (which everyone is). Fifteen years ago, you needed to buy a french keyboard to type french accented characters. Today it's a quick tap on your phone to get chinese, russian, tibetan, emojis, and any other character you can dream of. All of those surprising characters eventually get stored in Postres databases, often to the surprise of devs and admins, after they discover corruption from an OS upgrade. And to recap some data about historical ICU versions from the torture test: ICU Version | OS Version | en-US characters changed collation | zh-Hans-CN characters changed collation | fr-FR characters changed collation 55.1-7ubuntu0.5 | Ubuntu 16.04.7 LTS | 286,654 | 286,654 | 286,654 60.2-3ubuntu3.1 | Ubuntu 18.04.6 LTS | 23,741 | 24,415 | 23,741 63.1-6 | Ubuntu 19.04 | 688 | 688 | 688 66.1-2ubuntu2 | Ubuntu 20.04.3 LTS | 6,497 | 6,531 | 6,497 70.1-2 | Ubuntu 22.04 LTS | 879 | 887 | 879 The very clear trend here is that most changes are made in the root collation rules, affecting all locales. This means that worrying about specific collation versions of different locales is really focusing on an irrelevant edge case. In ICU development, all the locales tend to change. If anyone thinks the Collation Apocalypse is bad now, I predict the Kubernetes wave will be mayhem. Fifteen years ago it was rare to physically move PG datafiles to a new major OS. Most people would dump and load their databases, sized in GBs. Today's multi-TB Postgres databases have meant an increase of in-place OS upgrades in recent years. People started to either detach/attach their storage, or they used a hot standby. Kubernetes will make these moves across major OS's a daily, effortless occurrence. ICU doesn't fix anything directly. We do need ICU - only because it finally enables us to compile that old version of ICU forever on every new OS we move to going forward. This was simply impossible with glibc. Over the past couple decades, not even Oracle or IBM has managed to deprecate a single version of ICU from a relational database, and not for lack of desire. -Jeremy -- Jeremy Schneider Performance Engineer Amazon Web Services