Re: Losing my latin on Ordering...

Dominique Devienne Tue, 14 Feb 2023 03:18:14 -0800

On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz.a...@cybertec.at>
wrote:

> On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > Hi. Porting a unit test to PostgreSQL, we got a failure related to
> ordering.
> >
> > We've distilled it to the below. The DB is en_US.UTF-8, and the sorting
> we get
> > does not make sense to me. The same prefix can be sorted differently
> based on
> > the suffix apprently, which doesn't make any sense to me.
> >
> > Surely sorting should be "constant left-to-right", no? What are we
> missing?
>
> No, it isn't.  That's not how natural language collations work.
>

Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a
given collation are?

I'm aware of sorting taking numerical numbers in text influencing sort, so
"Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no
logical relatioship, like 9 and 10 do.

> > I'm already surprised (star) comes before (space), when the latter "comes
> > before" the former in both ASCII and UTF-8, but that the two "Foo*" and
> "Foo "
> > prefixed pairs are not clustered after sorting is just mistifying to me.
> So how come?
>
> Because they compare identical on the first three levels.  Any difference
> in
> letters, accents or case weighs stronger, even if it occurs to the right
> of these substrings.
>

That's completely unintuitive...

> > For now we can work-around this by explicitly adding the `collate "C"` on
> > the queries underlying that particular test, but that would be wrong in
> the
> > general case of international strings to sort, so I'd really like to
> understand
> > what's going on.
>
> Yes, it soulds like the "C" collation may be best for you.  That is, if
> you don't
> mind that "Z" < "a".
>

I would mind if I asked for case-insensitive comparisons.

So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?

Re: Losing my latin on Ordering...

Reply via email to