On Thu, Mar 12, 2026 at 3:20 PM Jeff Davis <[email protected]> wrote: > pg_upgrade should copy the locale settings to the new cluster as of > 9637badd9f. If there are still some rough edges here, let me know.
Oh, interesting! I obviously missed that. > > I don't have total information, but I think they mostly use a single > > locale. If they have extremely specific needs, they are likely to end > > up with ICU, else they pick a glibc locale. I have no idea how likely > > that glibc locale is to match their environment. I wouldn't bet on it > > being the norm, but I wouldn't bet against whatever they have in the > > environment being more usable than "C". > > That's interesting. In other words, (in your sample) users aren't > worried about the precise sort order in their native language; it's > just that ASCII is particularly bad, and almost any "real" locale is > more appealing. Some users are definitely worried about the specific ordering. I didn't mean to imply the contrary. The most demanding customers want an ordering that is bug-compatible with a previous system they've used; many can settle for something that is reasonably similar to some other system they've used, or simply want something appropriate to their country and language. Some are less demanding and, yeah, just anything that is basically reasonable is OK. > Range scans using a natural language collation are dubious. It can't be > for a prefix search; LIKE 'myprefix%' needs the index to be defined > with text_pattern_ops (which is code point order), so the default isn't > going to work for them anyway. > > (A prefix search can't be implemented with a range scan in natural > language collation because, e.g. in the cs_CZ locale, 'cha' does not > fall between 'ch' and 'ci'.) > > So how often is a range scan using a natural language collation > actually useful? I'm sure there are some real cases, but I'd say it's > usually a mistake and they are quite possibly getting wrong results. That seems really pessimistic to me. Many people don't test their code as thoroughly as they should, but a lot of people test it at least somewhat. > I have trouble understanding this perspective: slow all indexes down > (and sorts, too), and risk index inconsistencies just in case someone > ends up doing a range scan on one of the indexes? How is that safer? I mean ... you seem to be imagining that people care about sort and index lookup speed drastically more than what I have experienced. People mostly build indexes if they dump and restore, which most people do very rarely, bordering on never. Or if they reindex occasionally, which is more common, but still not an everyday occurrence for the overwhelming majority of users. Index lookups are common, but the difference between a single-entry lookup on a collation-C index and a single-entry lookup on a glibc-collation index isn't enough to worry about in any case I've ever encountered. I'm sure it's possible to construct cases, but are they really that common? Most queries do a lot more other work than the time they spend doing btree comparisons, at least IME. On the other hand, the possibility of having to use a sequential scan to find entries between X and Y is a potentially huge blow. I would never choose to use collate "C" if I thought there was any chance I would want a range scan. Surely it isn't worth the risk. If I know the data is something where collation is never going to matter (like PostgreSQL log lines ingested into a table) then yes, collate "C" makes a lot of sense. > * the use case must be real (not relying on faulty assumptions about > lexicographical ordering) > * the input data must be large enough to benefit from an index scan > * one of the following must be true: > - the index needs to be correlated with the heap order (seems > unlikely; correlation usually happens with sequences, timestamps, etc., > not natural language text values); or > - needs to be eligible for an index only scan (plausible); or > - the amount of data read must be small enough that correlation > with the heap doesn't matter > * the result data needs to be small enough for a human to consume it > (otherwise why bother with natural language?) > * the performance improvement must be enough to offset the penalty > for equality searches and index maintenance > > While each of those is plausible, when combined, I think it's far from > the typical case. I don't. I mean, let's suppose I've got a web page that displays a report with peoples names, street addresses, cities, states, countries, and telephone numbers. I can click on a column header to sort by that column. For which of those columns do I want a natural language sort? I would argue probably most of them. I might know that my state and country name are all ASCII, and then it doesn't matter. But surely people's names, for example, or their cities, could contain non-ASCII characters? And my report might easily be paginated, to avoid loading too much data in the web browser at once. If displaying the first page I want something like LIMIT 100, and I don't want to have to compute the entire result set to get those first 100 rows. Back when I wrote web applications, before starting at EDB, this is the kind of thing that I did all the time, for like ten years straight. I had plenty of text fields that could have used collate "C", because they contained things like part numbers or account numbers or whatever. But anything that contained a person's name or a company name or any other kind of name that is assigned by humans rather than generated by a computer could contain any of the characters that humans use, and should be sorted the way humans like. And isn't this a totally normal kind of application for somebody to write? It sure was for me. -- Robert Haas EDB: http://www.enterprisedb.com
