On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.a...@cybertec.at> wrote:
> On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote: > > Hi. Porting a unit test to PostgreSQL, we got a failure related to > ordering. > > > > We've distilled it to the below. The DB is en_US.UTF-8, and the sorting > we get > > does not make sense to me. The same prefix can be sorted differently > based on > > the suffix apprently, which doesn't make any sense to me. > > > > Surely sorting should be "constant left-to-right", no? What are we > missing? > > No, it isn't. That's not how natural language collations work. > Honestly, who expects the same prefix to sort differently based on what comes after, in left-to-right languages? How does one even find out what the (capricious?) rules for sorting in a given collation are? I'm aware of sorting taking numerical numbers in text influencing sort, so "Foo10" comes after "Foo9", but that's not what we are discussing here. "Foo*" and "Foo " have no logical relatioship, like 9 and 10 do. > > I'm already surprised (star) comes before (space), when the latter "comes > > before" the former in both ASCII and UTF-8, but that the two "Foo*" and > "Foo " > > prefixed pairs are not clustered after sorting is just mistifying to me. > So how come? > > Because they compare identical on the first three levels. Any difference > in > letters, accents or case weighs stronger, even if it occurs to the right > of these substrings. > That's completely unintuitive... > > For now we can work-around this by explicitly adding the `collate "C"` on > > the queries underlying that particular test, but that would be wrong in > the > > general case of international strings to sort, so I'd really like to > understand > > what's going on. > > Yes, it soulds like the "C" collation may be best for you. That is, if > you don't > mind that "Z" < "a". > I would mind if I asked for case-insensitive comparisons. So the "C" collation is fine with general UTF-8 encoding? I.e. it will be codepoint ordered OK?