Re: encoding affects ICU regex character classification

Jeff Davis Thu, 14 Dec 2023 07:12:50 -0800

On Tue, 2023-12-12 at 14:35 -0800, Jeremy Schneider wrote:
> Is someone able to test out upper & lower functions on U+A7BA ...
> U+A7BF
> across a few libs/versions?


Those code points are unassigned in Unicode 11.0 and assigned in
Unicode 12.0.

In ICU 63-2 (based on Unicode 11.0), they just get mapped to
themselves. In ICU 64-2 (based on Unicode 12.1) they get mapped the
same way the builtin CTYPE maps them (based on Unicode 15.1).

The concern over unassigned code points is misplaced. The application
may be aware of newly-assigned code points, and there's no way they
will be mapped correctly in Postgres if the provider is not aware of
those code points. The user can either proceed in using unassigned code
points and accept the risk of future changes, or wait for the provider
to be upgraded.

If the user doesn't have many expression indexes dependent on ctype
behavior, it doesn't matter much. If they do have such indexes, the
best we can offer is a controlled process, and the builtin provider
allows the most visibility and control.

(Aside: case mapping has very strong compatibility guarantees, but not
perfect. For better compatibility guarantees, we should support case
folding.)

> And I have no idea if or when
> glibc might have picked up the new unicode characters.

That's a strong argument in favor of a builtin provider.

Regards,
        Jeff Davis

Re: encoding affects ICU regex character classification

Reply via email to