Re: Support LIKE with nondeterministic collations

Peter Eisentraut Fri, 03 May 2024 11:58:45 -0700

On 03.05.24 17:47, Daniel Verite wrote:

        Peter Eisentraut wrote:

  However, off the top of my head, this definition has three flaws: (1)
It would make the single-character wildcard effectively an
any-number-of-characters wildcard, but only in some circumstances, which
could be confusing, (2) it would be difficult to compute, because you'd
have to check equality against all possible single-character strings,
and (3) it is not what the SQL standard says.


For #1 we're currently using the definition of a "character" as
being any single point of code,

That is the definition that is used throughout SQL and PostgreSQL. Wecan't change that without redefining everything. To pick just oneexample, the various trim function also behave in seemingly inconsistentways when you apply then to strings in different normalization forms.The better fix there is to enforce the normalization form somehow.

Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1], and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Even that page says, what we are calling character here is really calleda grapheme cluster.

In a different world, pattern matching, character trimming, etc. wouldwork by grapheme, but it does not.

Re: Support LIKE with nondeterministic collations

Reply via email to