On Mon, 2024-06-10 at 23:55 -0700, Will Mortensen wrote: > I mentioned to Jeremy at pgConf.dev that using non-default collations > in some SQL idioms can produce undesired results, and he asked me to > send an email. An example idiom is the way Django implements > case-insensitive comparisons using "upper(x) = upper(y)" [1][2][3] , > which returns false if x = y but they have different collations that > produce different uppercase.
Hi, Thank you for the examples. There are quite a few subtleties to getting case-insensitive comparisons right, and neither LOWER() nor UPPER() get everything quite right even if the collation is the same. For instance (for almost any locale other than "C"): UPPER(LOWER(U&'\1E9E')) != UPPER(U&'\1E9E') And: LOWER(UPPER(U&'\03C2')) != LOWER(U&'\03C2') The results of UPPER() and LOWER() can also change if some language adds a new case variant in the future, which could be a problem if the results are stored somewhere. How should we document all of that? If we include too many caveats, it's just frustrating. Instead, I propose that we implement Unicode "case folding" in PG18, which solves these issues by transforming the string to a canonical form suitable for case-insensitive comparison. (In most cases, the results are the same as LOWER(), but there are exceptions specifically to avoid the problems above.) Then, we can just have a section in the docs on "case folding" to describe the right way to use it. That still leaves one caveat: the handling of dotted- and dotless-i. But one caveat is a lot easier to keep track of. Regards, Jeff Davis