Re: [18] Unintentional behavior change in commit e9931bfb75

Noah Misch Sat, 12 Apr 2025 05:34:54 -0700

On Mon, Dec 02, 2024 at 10:24:07PM -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 17:25 -0500, Tom Lane wrote:


> > > Should I put the special case back?
> > 
> > I think so.
> 
> Done. I put the special case back in (commit e3fa2b037c) because the
> earlier commit wasn't intended to be a behavior change.

Commit e9931bf had also removed the corresponding regex special case:

> @@ -620,20 +545,6 @@ pg_wc_toupper(pg_wchar c)
>                       return c;
>               case PG_REGEX_BUILTIN:
>                       return unicode_uppercase_simple(c);
> -             case PG_REGEX_LOCALE_WIDE:
> -                     /* force C behavior for ASCII characters, per comments 
> above */
> -                     if (c <= (pg_wchar) 127)
> -                             return pg_ascii_toupper((unsigned char) c);
> -                     if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
> -                             return towupper((wint_t) c);

The "comments above" still exist:

 * 2. In the "default" collation (which is supposed to obey LC_CTYPE):
 *
 * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
 * This assumes that every platform uses Unicode codepoints directly
 * as the wchar_t representation of Unicode.  On some platforms
 * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
 *
 * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
 * values up to 255, and punt for values above that.  This is 100% correct
 * only in single-byte encodings such as LATINn.  However, non-Unicode
 * multibyte encodings are mostly Far Eastern character sets for which the
 * properties being tested here aren't very relevant for higher code values
 * anyway.  The difficulty with using the <wctype.h> functions with
 * non-Unicode multibyte encodings is that we can have no certainty that
 * the platform's wchar_t representation matches what we do in pg_wchar
 * conversions.
 *
 * 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
 * functions, under exactly the same cases as #2.
 *
 * There is one notable difference between cases 2 and 3: in the "default"
 * collation we force ASCII letters to follow ASCII upcase/downcase rules,
 * while in a non-default collation we just let the library functions do what
 * they will.  The case where this matters is treatment of I/i in Turkish,
 * and the behavior is meant to match the upper()/lower() SQL functions.

I think the code for (2) and for "I/i in Turkish" haven't returned.  Given
commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain lower(),
the regex code likely needs a similar restoration.  If not, the regex comments
would need to change to match the code.

Re: [18] Unintentional behavior change in commit e9931bfb75

Reply via email to