Re: Pre-proposal: unicode normalized text

Robert Haas Mon, 02 Oct 2023 13:06:41 -0700

On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut <[email protected]> wrote:
> I think a better direction here would be to work toward making
> nondeterministic collations usable on the global/database level and then
> encouraging users to use those.


It seems to me that this overlooks one of the major points of Jeff's
proposal, which is that we don't reject text input that contains
unassigned code points. That decision turns out to be really painful.
Here, Jeff mentions normalization, but I think it's a major issue with
collation support. If new code points are added, users can put them
into the database before they are known to the collation library, and
then when they become known to the collation library the sort order
changes and indexes break. Would we endorse a proposal to make
pg_catalog.text with encoding UTF-8 reject code points that aren't yet
known to the collation library? To do so would be tighten things up
considerably from where they stand today, and the way things stand
today is already rigid enough to cause problems for some users. But if
we're not willing to do that then I find it easy to understand why
Jeff wants an alternative type that does.

Now, there is still the question of whether such a data type would
properly belong in core or even contrib rather than being an
out-of-core project. It's not obvious to me that such a data type
would get enough traction that we'd want it to be part of PostgreSQL
itself. But at the same time I can certainly understand why Jeff finds
the status quo problematic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Pre-proposal: unicode normalized text

Reply via email to