Re: Built-in CTYPE provider

Jeremy Schneider Tue, 12 Dec 2023 13:14:52 -0800

On 12/5/23 3:46 PM, Jeff Davis wrote:
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.
> 
> === Questions ===
> 
> * Is a built-in ctype provider a reasonable direction for Postgres as
>   a project?
> * Does it feel like it would be simpler or more complex than what
>   we're doing now?
> * Do we want to just try to improve our ICU support instead?
> * Do we want the built-in provider to be one thing, or have a few
>   options (e.g. "standard" or "posix" character classification;
>   "simple" or "full" case mapping)?



Generally, I am in favor of this - I think we need to move in the
direction of having an in-database option around unicode for PG users,
given how easy it is for administrators to mis-manage dependencies.
Especially when OS admins can be different from DB admins, and when
nobody really understands risks of changing libs with in-place moves to
new operating systems - except for like 4 of us on the mailing lists.

My biggest concern is around maintenance. Every year Unicode is
assigning new characters to existing code points, and those existing
code points can of course already be stored in old databases before libs
are updated. When users start to notice that regex [[:digit:]] or
upper/lower functions aren't working correctly with characters in their
DB, they'll probably come asking for fixes. And we may end up with
something like the timezone database where we need to periodically add a
more current ruleset - albeit alongside as a new version in this case.

Here are direct links to charts of newly assigned characters from the
last few Unicode updates:

2022: https://www.unicode.org/charts/PDF/Unicode-15.0/
2021: https://www.unicode.org/charts/PDF/Unicode-14.0/
2020: https://www.unicode.org/charts/PDF/Unicode-13.0/
2019: https://www.unicode.org/charts/PDF/Unicode-12.0/

If I'm reading the Unicode 15 update correctly, PostgreSQL regex
expressions with [[:digit:]] will not correctly identify Kaktovik or Nag
Mundari or Kawi digits without that update to character type specs.

If I'm reading the Unicode 12 update correctly, then upper/lower
functions aren't going to work correctly on Latin Glottal A and I and U
characters without that update to character type specs.

Overall I see a lot fewer Unicode updates involving upper/lower than I
do with digits - especially since new scripts often involve their own
numbering characters which makes new digits more common.

But lets remember that people like to build indexes on character
classification functions like upper/lower, for case insensitive
searching. It's another case where the index will be corrupted if
someone happened to store Latin Glottal vowels in their database and
then we update libs to the latest character type rules.

So even with something as basic as character type, if we're going to do
it right, we still need to either version it or definitively decide that
we're not going to every support newly added Unicode characters like
Latin Glottals.

-Jeremy


-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

Reply via email to