I have a proposal for how to support tailoring rules in ICU collations: The
ucol_openRules() function is an alternative to the ucol_open() function that
PostgreSQL calls today, but it takes the collation strength as one if its
parameters so the locale string would need to be parsed before creating the
collator. After the collator is created using either ucol_openRules or
ucol_open, the ucol_setAttribute() function may be used to set individual
attributes from keyword=value pairs in the locale string as it does now, except
that the strength probably can't be changed after opening the collator with
ucol_openRules. So the logic in pg_locale.c would need to be reorganized a
little bit, but that sounds straightforward.
One simple solution would be to have the tailoring rules be specified as a new
keyword=value pair, such as colTailoringRules=<rulestring>. Since the
<rulestring> may contain single quote characters or PostgreSQL escape
characters, any single quote characters or escapes would need to be escaped
using PostgreSQL escape rules. If colTailoringRules is present, colStrength
would also be known prior to opening the collator, or would default to
tertiary, and we would keep a local flag indicating that we should not process
the colStrength keyword again, if specified.
Representing the TailoringRules as just another keyword=value in the locale
string means that we don't need any change to the catalog to store it. It's
just part of the locale specification. I think we wouldn't even need to bump
the catversion.
Are there any tailoring rules, such as expansions and contractions, that we
should disallow? I realize that we don't handle nondeterministic collations in
LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a
database with a UTF-8 encoding and arbitrary tailoring rules that include
expansions and contractions, is it still guaranteed that expr must sort BETWEEN
'a' AND ('a' || E'/uFFFF') ?