Re: Remaining dependency on setlocale()

Chao Li Tue, 25 Nov 2025 17:51:55 -0800

> On Nov 26, 2025, at 01:20, Jeff Davis <[email protected]> wrote:
> 
> 
> New series attached with only these changes and a rebase.
> 
> Regards,
> Jeff Davis
> 
> <v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>

I continued reviewing 0005-0008.

0005 - no comment. The change looks correct to me. Before the patch, 
pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR should be the more common 
cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common 
cases, thus there should be not a behavior change.

5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int 
(*cmpptr) (const char *,
 int
 ltree_strncasecmp(const char *a, const char *b, size_t s)
 {
-       char       *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
-       char       *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+       static pg_locale_t locale = NULL;
+       size_t          al_sz = s + 1;
+       char       *al = palloc(al_sz);
+       size_t          bl_sz = s + 1;
+       char       *bl = palloc(bl_sz);
+       size_t          needed;
        int                     res;
 
+       if (!locale)
+               locale = pg_database_locale();
+
+       needed = pg_strfold(al, al_sz, a, s, locale);
+       if (needed + 1 > al_sz)
+       {
+               /* grow buffer if needed and retry */
+               al_sz = needed + 1;
+               al = repalloc(al, al_sz);
+               needed = pg_strfold(al, al_sz, a, s, locale);
+               Assert(needed + 1 <= al_sz);
+       }
+
+       needed = pg_strfold(bl, bl_sz, b, s, locale);
+       if (needed + 1 > bl_sz)
+       {
+               /* grow buffer if needed and retry */
+               bl_sz = needed + 1;
+               bl = repalloc(bl, bl_sz);
+               needed = pg_strfold(bl, bl_sz, b, s, locale);
+               Assert(needed + 1 <= bl_sz);
+       }
+
        res = strncmp(al, bl, s);
 
        pfree(al);
```

I do think the new implementation has some problem.

* The retry logic implies that a single-byte char may become multiple bytes 
after folding, otherwise retry is not needed because you have allocated s+1 
bytes for dest buffers. From this perspective, we should use two needed 
variables: neededA and neededB, if neededA != neededB, then the two strings are 
different; if neededA == neededB, then we should be perform strncmp, but here 
we should pass neededA (or neededB as they are identical) to strncmp(al, bl, 
neededA).

* Based on the logic you implemented in 0004, first pg_strfold() has copied as 
many chars as possible to dest buffer, so when retry, ideally we can should 
resume instead of start over. However, if single-byte->multi-byte folding 
happens, we have no information to decide from where to resume. From this 
perspective, in 0004, do we really need to take the try-the-best strategy for 
strlower_c()? If there are some other use cases that require data to be placed 
in dest buffer even if dest buffer doesn’t have enough space, then my patch [1] 
of changing strlower_libc_sb() should be considered.

6 - 0007
```
        /*
-        * For efficiency reasons, in the single byte case we don't call lower()
-        * on the pattern and text, but instead call SB_lower_char on each
-        * character.  In the multi-byte case we don't have much choice :-(. 
Also,
-        * ICU does not support single-character case folding, so we go the long
-        * way.
+        * For efficiency reasons, in the C locale we don't call lower() on the
+        * pattern and text, but instead call SB_lower_char on each character.
         */
```

SB_lower_char should be changed to C_IMatchText.

7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
 ```

I think the comment should be updated accordingly, like “for ILIKE in the C 
locale”.


8 - 0008
```
+       /* for text types, use pg_wchar; for BYTEA, use char */
        if (typeid != BYTEAOID)
        {
-               patt = TextDatumGetCString(patt_const->constvalue);
-               pattlen = strlen(patt);
+               text       *val = DatumGetTextPP(patt_const->constvalue);
+               pg_wchar   *wpatt;
+               pg_wchar   *wmatch;
+               char       *match;
+
+               patt = VARDATA_ANY(val);
+               pattlen = VARSIZE_ANY_EXHDR(val);
+               wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+               wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+               pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+               match = palloc(pattlen + 1);
```

* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen 
multiple-byte chars?

* match is not freed, but looks like it should be: 

*prefix_const = string_to_const(match, typeid);
 -> string_to_datum(str, datatype);
 -> CStringGetTextDatum(str);
 -> cstring_to_text(s)
 -> cstring_to_text_with_len(s, strlen(s));
 -> *result = (text *) palloc(len + VARHDRSZ);

So, match has been copied, it should be freed.

9 - 0008
```
-       }
+                       /* Backslash escapes the next character */
+                       if (patt[pos] == '\\')
+                       {
+                               pos++;
+                               if (pos >= pattlen)
+                                       break;
+                       }
 
-       match[match_pos] = '\0';
+                       match[match_pos++] = pos;
+               }
```

Should “pos” be “part[pos]” assigning to match[match_pos++]?

I will review the rest 3 commits tomorrow.

[1] 
https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vavcj5egikd53f1mntwtdxizx...@mail.gmail.com

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Remaining dependency on setlocale()

Reply via email to