On 12/5/23 3:46 PM, Jeff Davis wrote: > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec. > > === Questions === > > * Is a built-in ctype provider a reasonable direction for Postgres as > a project? > * Does it feel like it would be simpler or more complex than what > we're doing now? > * Do we want to just try to improve our ICU support instead? > * Do we want the built-in provider to be one thing, or have a few > options (e.g. "standard" or "posix" character classification; > "simple" or "full" case mapping)?
Generally, I am in favor of this - I think we need to move in the direction of having an in-database option around unicode for PG users, given how easy it is for administrators to mis-manage dependencies. Especially when OS admins can be different from DB admins, and when nobody really understands risks of changing libs with in-place moves to new operating systems - except for like 4 of us on the mailing lists. My biggest concern is around maintenance. Every year Unicode is assigning new characters to existing code points, and those existing code points can of course already be stored in old databases before libs are updated. When users start to notice that regex [[:digit:]] or upper/lower functions aren't working correctly with characters in their DB, they'll probably come asking for fixes. And we may end up with something like the timezone database where we need to periodically add a more current ruleset - albeit alongside as a new version in this case. Here are direct links to charts of newly assigned characters from the last few Unicode updates: 2022: https://www.unicode.org/charts/PDF/Unicode-15.0/ 2021: https://www.unicode.org/charts/PDF/Unicode-14.0/ 2020: https://www.unicode.org/charts/PDF/Unicode-13.0/ 2019: https://www.unicode.org/charts/PDF/Unicode-12.0/ If I'm reading the Unicode 15 update correctly, PostgreSQL regex expressions with [[:digit:]] will not correctly identify Kaktovik or Nag Mundari or Kawi digits without that update to character type specs. If I'm reading the Unicode 12 update correctly, then upper/lower functions aren't going to work correctly on Latin Glottal A and I and U characters without that update to character type specs. Overall I see a lot fewer Unicode updates involving upper/lower than I do with digits - especially since new scripts often involve their own numbering characters which makes new digits more common. But lets remember that people like to build indexes on character classification functions like upper/lower, for case insensitive searching. It's another case where the index will be corrupted if someone happened to store Latin Glottal vowels in their database and then we update libs to the latest character type rules. So even with something as basic as character type, if we're going to do it right, we still need to either version it or definitively decide that we're not going to every support newly added Unicode characters like Latin Glottals. -Jeremy -- http://about.me/jeremy_schneider