On Thu, 19 Jun 2025, 03:53 Jeff Davis, <pg...@j-davis.com> wrote: > On Wed, 2025-06-18 at 19:09 +0200, Vik Fearing wrote: > > I don't know. I am just pointing out what the Standard says. I > > think > > we should either comply, or say that we don't do it for LOWER and > > UPPER > > so let's keep things implementation-consistent. > > For the standard, I see two potential philosophies: > > I. CASEFOLD() is another variant of LOWER()/UPPER(), and it should > preserve NFC in the same way. > > II. CASEFOLD() is not like LOWER()/UPPER(); it returns a semi-opaque > text value that is useful for caseless matching, but should not > ordinarily be used for display or sent to the application (those things > would be allowed, just not encouraged). For normalization, either: > (A) Follow Unicode Default Caseless Matching (16.0 3.13.5 D144), and > don't require any kind of normalization; or > (B) Follow Unicode Canonical Caseless Matching (D145), and require > that the input and output are normalized appropriately, but leave the > precise normal form as implementation-defined. > > > The current implementation could either be seen as philosophy (I) where > we've chosen to ignore the normalization part for the sake of > consistency with LOWER()/UPPER(); or it could be seen as philosophy > (II)(A). > > > How much does it cost to check for NFC? I honestly don't know the > > answer to that question, but that is the only case where we need to > > maintain normalization. > > I attached a very rough patch and ran a very simple test on strings > averaging 36 bytes in length, all already in NFC and the result is also > NFC. Before the patch, doing a CASEFOLD() on 10M tuples took about 3 > seconds, afterward about 8. > > There's a patch to optimize some of the normalization paths, which I > haven't had a chance to review yet. So those numbers might come down. > > > > > It's not unconditionally, it's only if the input was NFC. > > Optimizing the case where the input is _not_ NFC seems strange to me. > If we are normalizing the output, I'd say we should just make the > output always NFC. Being more strict, this seems likely to comply with > the eventual standard. > > Additionally, if we are normalizing the output, then we should also do > the input fixup for U+0345, which would make the result usable for > Canonical Caseless Matching. Again, this seems likely to comply with > the eventual standard. > > > > > So I only see two reasonable implementations: > > 1. The current CASEFOLD() implementation. > > 2. Do the input fixup for U+0345 and unconditionally normalize the > output in NFC. > > If there's a case to be made for both implementations, we could also > consider having two functions, say, CASEFOLD() for #1 and NCASEFOLD() > for #2. I'm not sure whether we'd want to standardize one or both of > those functions. > > And if you think there's likely to be a collision with the standard > that's hard to anticipate and fix now, then we should consider > reverting CASEFOLD() for 18 and wait for more progress on the > standardization. What's the likelihood that the name changes or > something like that? >
Late to the party, but is there an argument for porting this to the citext type? Or supplementing the extension with an additional type ("cftext"? *shrug*). It currently uses lower(), so our current recommendation for dealing with all unicode characters is to use nondeterministic collations. Thom >