On Fri, 2025-12-05 at 16:01 +0100, Peter Eisentraut wrote:
> v11-0003-Fix-inconsistency-between-ltree_strncasecmp-and-.patch
>
> The function comment reads "Check if b has a prefix of a." -- Is that
> the same as "Check if a is a prefix of b."? The latter might be
> clearer.
Yes, fixed.
Note: I separated this into two patches. 0003 fixes the multibyte
mishandling issue, and 0004 consistently performs case folding. 0003 is
backpatchable, I believe.
> but the patch removes SB_lower_char().
Fixed and committed.
> v11-0006-Use-multibyte-aware-extraction-of-pattern-prefix.patch
>
> Might have a small typo in the commit message:
>
> ; and preserve and char-at-a-time logic for bytea.
Fixed.
I also changed it into two functions: like_fixed_prefix(), which is
almost unchanged from the original; and like_fixed_prefix_ci(), which
is multibyte and locale-aware. It was too confusing to have single-byte
and multi-byte logic in the same function, and they didn't share much
code anyway.
> case '\xc7': /* C with cedilla */
>
> so the premise that "fuzzystrmatch is designed for ASCII" does not
> appear to be correct. Needs more analysis.
>
> (But apparently it's not multibyte aware at all, so I don't know what
> to
> do about that.)
I didn't notice that, thank you. Agreed, we need a bit more discussion
around this case as well as soundex().
> v11-0008-downcase_identifier-use-method-table-from-locale.patch
>
> I'm confused here about the name of the function pg_strfold_ident().
> In
> general, case "folding" results in an opaque string that is really
> only
> useful for comparing against other case-folded strings. But for
> identifiers we are actually interested lower-casing. I think this
> should be corrected in the API naming.
Agreed and fixed.
Also, I added 0006, which saves a locale_t object for ICU in this one
case where it's required. Surely that's not what we want in the long
term, but we don't have the infrastructure for decoding pg_wchar into
code points yet, and 0006 avoids the dependency on the global LC_CTYPE
setting.
> v11-0009-Control-LC_COLLATE-with-GUC.patch
>
> I know there were some complaints about compatibility with
> extensions,
> but I don't think anything concrete was presented. I would like to
> see
> more evidence that we need this.
>
> Also, recall that we used to have a lc_collate GUC, and in the end
> people got confused that it didn't actually show a meaningful value
> when
> you used ICU. So we removed that. It seems adding this back in
> would
> create a similar kind of confusion. So to avoid that, maybe this
> should
> be called fallback_lc_collate or something like that.
Yes, this is a POC patch and needs more discussion.
What are your thoughts about a similar lc_ctype GUC, though? That has
slightly different trade-offs.
I believe v12 0001-0005 are about ready for commit, and 0003 should be
backported.
Regards,
Jeff Davis
From 779205c112bfbd1f89fc0edd9a4d7b932d21e15e Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 12 Dec 2025 09:44:37 -0800
Subject: [PATCH v12 1/8] Use multibyte-aware extraction of pattern prefixes.
Previously, like_fixed_prefix() used char-at-a-time logic, which
forced it to be too conservative for case-insensitive matching.
Introduce like_fixed_prefix_ci(), and use that for case-insensitive
pattern prefixes. It uses multibyte and locale-aware logic, along with
the new pg_iswcased() API introduced in 630706ced0.
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
src/backend/utils/adt/like_support.c | 169 ++++++++++++++++++---------
1 file changed, 112 insertions(+), 57 deletions(-)
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index dca1d9be035..007dd5b5a01 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -99,8 +99,6 @@ static Selectivity like_selectivity(const char *patt, int pattlen,
static Selectivity regex_selectivity(const char *patt, int pattlen,
bool case_insensitive,
int fixed_prefix_len);
-static int pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale);
static Const *make_greater_string(const Const *str_const, FmgrInfo *ltproc,
Oid collation);
static Datum string_to_datum(const char *str, Oid datatype);
@@ -986,8 +984,8 @@ icnlikejoinsel(PG_FUNCTION_ARGS)
*/
static Pattern_Prefix_Status
-like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
- Const **prefix_const, Selectivity *rest_selec)
+like_fixed_prefix(Const *patt_const, Const **prefix_const,
+ Selectivity *rest_selec)
{
char *match;
char *patt;
@@ -995,34 +993,10 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
Oid typeid = patt_const->consttype;
int pos,
match_pos;
- bool is_multibyte = (pg_database_encoding_max_length() > 1);
- pg_locale_t locale = 0;
/* the right-hand const is type text or bytea */
Assert(typeid == BYTEAOID || typeid == TEXTOID);
- if (case_insensitive)
- {
- if (typeid == BYTEAOID)
- ereport(ERROR,
- (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("case insensitive matching not supported on type bytea")));
-
- if (!OidIsValid(collation))
- {
- /*
- * This typically means that the parser could not resolve a
- * conflict of implicit collations, so report it that way.
- */
- ereport(ERROR,
- (errcode(ERRCODE_INDETERMINATE_COLLATION),
- errmsg("could not determine which collation to use for ILIKE"),
- errhint("Use the COLLATE clause to set the collation explicitly.")));
- }
-
- locale = pg_newlocale_from_collation(collation);
- }
-
if (typeid != BYTEAOID)
{
patt = TextDatumGetCString(patt_const->constvalue);
@@ -1055,11 +1029,6 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
break;
}
- /* Stop if case-varying character (it's sort of a wildcard) */
- if (case_insensitive &&
- pattern_char_isalpha(patt[pos], is_multibyte, locale))
- break;
-
match[match_pos++] = patt[pos];
}
@@ -1071,8 +1040,7 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
*prefix_const = string_to_bytea_const(match, match_pos);
if (rest_selec != NULL)
- *rest_selec = like_selectivity(&patt[pos], pattlen - pos,
- case_insensitive);
+ *rest_selec = like_selectivity(&patt[pos], pattlen - pos, false);
pfree(patt);
pfree(match);
@@ -1087,6 +1055,112 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
return Pattern_Prefix_None;
}
+/*
+ * Case-insensitive variant of like_fixed_prefix(). Multibyte and
+ * locale-aware for detecting cased characters.
+ */
+static Pattern_Prefix_Status
+like_fixed_prefix_ci(Const *patt_const, Oid collation, Const **prefix_const,
+ Selectivity *rest_selec)
+{
+ text *val = DatumGetTextPP(patt_const->constvalue);
+ Oid typeid = patt_const->consttype;
+ int nbytes = VARSIZE_ANY_EXHDR(val);
+ int wpos;
+ pg_wchar *wpatt;
+ int wpattlen;
+ pg_wchar *wmatch;
+ int wmatch_pos = 0;
+ char *match;
+ int match_mblen pg_attribute_unused();
+ pg_locale_t locale = 0;
+
+ /* the right-hand const is type text or bytea */
+ Assert(typeid == BYTEAOID || typeid == TEXTOID);
+
+ if (typeid == BYTEAOID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("case insensitive matching not supported on type bytea")));
+
+ if (!OidIsValid(collation))
+ {
+ /*
+ * This typically means that the parser could not resolve a conflict
+ * of implicit collations, so report it that way.
+ */
+ ereport(ERROR,
+ (errcode(ERRCODE_INDETERMINATE_COLLATION),
+ errmsg("could not determine which collation to use for ILIKE"),
+ errhint("Use the COLLATE clause to set the collation explicitly.")));
+ }
+
+ locale = pg_newlocale_from_collation(collation);
+
+ wpatt = palloc((nbytes + 1) * sizeof(pg_wchar));
+ wpattlen = pg_mb2wchar_with_len(VARDATA_ANY(val), wpatt, nbytes);
+
+ wmatch = palloc((nbytes + 1) * sizeof(pg_wchar));
+ for (wpos = 0; wpos < wpattlen; wpos++)
+ {
+ /* % and _ are wildcard characters in LIKE */
+ if (wpatt[wpos] == '%' ||
+ wpatt[wpos] == '_')
+ break;
+
+ /* Backslash escapes the next character */
+ if (wpatt[wpos] == '\\')
+ {
+ wpos++;
+ if (wpos >= wpattlen)
+ break;
+ }
+
+ /*
+ * For ILIKE, stop if it's a case-varying character (it's sort of a
+ * wildcard).
+ */
+ if (pg_iswcased(wpatt[wpos], locale))
+ break;
+
+ wmatch[wmatch_pos++] = wpatt[wpos];
+ }
+
+ wmatch[wmatch_pos] = '\0';
+
+ match = palloc(pg_database_encoding_max_length() * wmatch_pos + 1);
+ match_mblen = pg_wchar2mb_with_len(wmatch, match, wmatch_pos);
+ match[match_mblen] = '\0';
+ pfree(wmatch);
+
+ *prefix_const = string_to_const(match, TEXTOID);
+ pfree(match);
+
+ if (rest_selec != NULL)
+ {
+ int wrestlen = wpattlen - wmatch_pos;
+ char *rest;
+ int rest_mblen;
+
+ rest = palloc(pg_database_encoding_max_length() * wrestlen + 1);
+ rest_mblen = pg_wchar2mb_with_len(&wpatt[wmatch_pos], rest, wrestlen);
+
+ *rest_selec = like_selectivity(rest, rest_mblen, true);
+ pfree(rest);
+ }
+
+ pfree(wpatt);
+
+ /* in LIKE, an empty pattern is an exact match! */
+ if (wpos == wpattlen)
+ return Pattern_Prefix_Exact; /* reached end of pattern, so exact */
+
+ if (wmatch_pos > 0)
+ return Pattern_Prefix_Partial;
+
+ return Pattern_Prefix_None;
+}
+
static Pattern_Prefix_Status
regex_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
Const **prefix_const, Selectivity *rest_selec)
@@ -1164,12 +1238,11 @@ pattern_fixed_prefix(Const *patt, Pattern_Type ptype, Oid collation,
switch (ptype)
{
case Pattern_Type_Like:
- result = like_fixed_prefix(patt, false, collation,
- prefix, rest_selec);
+ result = like_fixed_prefix(patt, prefix, rest_selec);
break;
case Pattern_Type_Like_IC:
- result = like_fixed_prefix(patt, true, collation,
- prefix, rest_selec);
+ result = like_fixed_prefix_ci(patt, collation, prefix,
+ rest_selec);
break;
case Pattern_Type_Regex:
result = regex_fixed_prefix(patt, false, collation,
@@ -1481,24 +1554,6 @@ regex_selectivity(const char *patt, int pattlen, bool case_insensitive,
return sel;
}
-/*
- * Check whether char is a letter (and, hence, subject to case-folding)
- *
- * In multibyte character sets or with ICU, we can't use isalpha, and it does
- * not seem worth trying to convert to wchar_t to use iswalpha or u_isalpha.
- * Instead, just assume any non-ASCII char is potentially case-varying, and
- * hard-wire knowledge of which ASCII chars are letters.
- */
-static int
-pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale)
-{
- if (locale->ctype_is_c)
- return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
- else
- return char_is_cased(c, locale);
-}
-
/*
* For bytea, the increment function need only increment the current byte
--
2.43.0
From 48620dadcfeec2880575c963441bb1dd017802f0 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 12 Dec 2025 09:44:59 -0800
Subject: [PATCH v12 2/8] Remove unused single-byte char_is_cased() API.
https://postgr.es/m/[email protected]
---
src/backend/utils/adt/pg_locale.c | 15 ---------------
src/backend/utils/adt/pg_locale_builtin.c | 8 --------
src/backend/utils/adt/pg_locale_icu.c | 8 --------
src/backend/utils/adt/pg_locale_libc.c | 14 --------------
src/include/utils/pg_locale.h | 3 ---
5 files changed, 48 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 70933ee3843..8a3796aa5d0 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1625,21 +1625,6 @@ pg_towlower(pg_wchar wc, pg_locale_t locale)
return locale->ctype->wc_tolower(wc, locale);
}
-/*
- * char_is_cased()
- *
- * Fuzzy test of whether the given char is case-varying or not. The argument
- * is a single byte, so in a multibyte encoding, just assume any non-ASCII
- * char is case-varying.
- */
-bool
-char_is_cased(char ch, pg_locale_t locale)
-{
- if (locale->ctype == NULL)
- return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
- return locale->ctype->char_is_cased(ch, locale);
-}
-
/*
* Return required encoding ID for the given locale, or -1 if any encoding is
* valid for the locale.
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 0d4c754a267..0c2920112bb 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -191,13 +191,6 @@ wc_iscased_builtin(pg_wchar wc, pg_locale_t locale)
return pg_u_prop_cased(to_char32(wc));
}
-static bool
-char_is_cased_builtin(char ch, pg_locale_t locale)
-{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
-}
-
static pg_wchar
wc_toupper_builtin(pg_wchar wc, pg_locale_t locale)
{
@@ -225,7 +218,6 @@ static const struct ctype_methods ctype_methods_builtin = {
.wc_ispunct = wc_ispunct_builtin,
.wc_isspace = wc_isspace_builtin,
.wc_isxdigit = wc_isxdigit_builtin,
- .char_is_cased = char_is_cased_builtin,
.wc_iscased = wc_iscased_builtin,
.wc_tolower = wc_tolower_builtin,
.wc_toupper = wc_toupper_builtin,
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index e8820666b2d..18d026deda8 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -121,13 +121,6 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const char *locale,
UErrorCode *pErrorCode);
-static bool
-char_is_cased_icu(char ch, pg_locale_t locale)
-{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
-}
-
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
* UChar32, which is correct for the UTF-8 encoding, but not in general.
@@ -244,7 +237,6 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_ispunct = wc_ispunct_icu,
.wc_isspace = wc_isspace_icu,
.wc_isxdigit = wc_isxdigit_icu,
- .char_is_cased = char_is_cased_icu,
.wc_iscased = wc_iscased_icu,
.wc_toupper = toupper_icu,
.wc_tolower = tolower_icu,
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 3d841f818a5..3baa5816b5f 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -262,17 +262,6 @@ wc_iscased_libc_mb(pg_wchar wc, pg_locale_t locale)
iswlower_l((wint_t) wc, locale->lt);
}
-static bool
-char_is_cased_libc(char ch, pg_locale_t locale)
-{
- bool is_multibyte = pg_database_encoding_max_length() > 1;
-
- if (is_multibyte && IS_HIGHBIT_SET(ch))
- return true;
- else
- return isalpha_l((unsigned char) ch, locale->lt);
-}
-
static pg_wchar
toupper_libc_sb(pg_wchar wc, pg_locale_t locale)
{
@@ -345,7 +334,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
@@ -371,7 +359,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
@@ -393,7 +380,6 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_ispunct = wc_ispunct_libc_mb,
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_mb,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 832007385d8..01f891def7a 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -125,9 +125,6 @@ struct ctype_methods
bool (*wc_iscased) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_toupper) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_tolower) (pg_wchar wc, pg_locale_t locale);
-
- /* required */
- bool (*char_is_cased) (char ch, pg_locale_t locale);
};
/*
--
2.43.0
From f6923421824c4cdefb83b781a234ebfd562b86ed Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Thu, 4 Dec 2025 10:37:56 -0800
Subject: [PATCH v12 3/8] Fix multibyte issue in ltree_strncasecmp().
The API for ltree_strncasecmp() took two inputs but only one length
(that of the smaller input). It truncated the larger input to that
length, but that could break a multibyte sequence.
Refactor and rename to be a check for prefix equality (possibly
case-insensitive) instead, which is all that's needed by the
callers. Also, provide the lengths of both inputs.
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14
---
contrib/ltree/lquery_op.c | 41 +++++++++++++++++++++++++-----------
contrib/ltree/ltree.h | 6 ++++--
contrib/ltree/ltxtquery_op.c | 8 +++----
3 files changed, 37 insertions(+), 18 deletions(-)
diff --git a/contrib/ltree/lquery_op.c b/contrib/ltree/lquery_op.c
index a6466f575fd..d58e34769b8 100644
--- a/contrib/ltree/lquery_op.c
+++ b/contrib/ltree/lquery_op.c
@@ -41,7 +41,9 @@ getlexeme(char *start, char *end, int *len)
}
bool
-compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *, const char *, size_t), bool anyend)
+compare_subnode(ltree_level *t, char *qn, int len,
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t),
+ bool anyend)
{
char *endt = t->name + t->len;
char *endq = qn + len;
@@ -57,7 +59,7 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
while ((tn = getlexeme(tn, endt, &lent)) != NULL)
{
if ((lent == lenq || (lent > lenq && anyend)) &&
- (*cmpptr) (qn, tn, lenq) == 0)
+ (*prefix_eq) (qn, lenq, tn, lent))
{
isok = true;
@@ -74,14 +76,29 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
return true;
}
-int
-ltree_strncasecmp(const char *a, const char *b, size_t s)
+/*
+ * Check if 'a' is a prefix of 'b'.
+ */
+bool
+ltree_prefix_eq(const char *a, size_t a_sz, const char *b, size_t b_sz)
+{
+ if (a_sz > b_sz)
+ return false;
+ else
+ return (strncmp(a, b, a_sz) == 0);
+}
+
+/*
+ * Case-insensitive check if 'a' is a prefix of 'b'.
+ */
+bool
+ltree_prefix_eq_ci(const char *a, size_t a_sz, const char *b, size_t b_sz)
{
- char *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
- int res;
+ char *al = str_tolower(a, a_sz, DEFAULT_COLLATION_OID);
+ char *bl = str_tolower(b, b_sz, DEFAULT_COLLATION_OID);
+ bool res;
- res = strncmp(al, bl, s);
+ res = (strncmp(al, bl, a_sz) == 0);
pfree(al);
pfree(bl);
@@ -109,19 +126,19 @@ checkLevel(lquery_level *curq, ltree_level *curt)
for (int i = 0; i < curq->numvar; i++)
{
- int (*cmpptr) (const char *, const char *, size_t);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t);
- cmpptr = (curvar->flag & LVAR_INCASE) ? ltree_strncasecmp : strncmp;
+ prefix_eq = (curvar->flag & LVAR_INCASE) ? ltree_prefix_eq_ci : ltree_prefix_eq;
if (curvar->flag & LVAR_SUBLEXEME)
{
- if (compare_subnode(curt, curvar->name, curvar->len, cmpptr,
+ if (compare_subnode(curt, curvar->name, curvar->len, prefix_eq,
(curvar->flag & LVAR_ANYEND)))
return success;
}
else if ((curvar->len == curt->len ||
(curt->len > curvar->len && (curvar->flag & LVAR_ANYEND))) &&
- (*cmpptr) (curvar->name, curt->name, curvar->len) == 0)
+ (*prefix_eq) (curvar->name, curvar->len, curt->name, curt->len))
return success;
curvar = LVAR_NEXT(curvar);
diff --git a/contrib/ltree/ltree.h b/contrib/ltree/ltree.h
index 5e0761641d3..08199ceb588 100644
--- a/contrib/ltree/ltree.h
+++ b/contrib/ltree/ltree.h
@@ -208,9 +208,11 @@ bool ltree_execute(ITEM *curitem, void *checkval,
int ltree_compare(const ltree *a, const ltree *b);
bool inner_isparent(const ltree *c, const ltree *p);
bool compare_subnode(ltree_level *t, char *qn, int len,
- int (*cmpptr) (const char *, const char *, size_t), bool anyend);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t),
+ bool anyend);
ltree *lca_inner(ltree **a, int len);
-int ltree_strncasecmp(const char *a, const char *b, size_t s);
+bool ltree_prefix_eq(const char *a, size_t a_sz, const char *b, size_t b_sz);
+bool ltree_prefix_eq_ci(const char *a, size_t a_sz, const char *b, size_t b_sz);
/* fmgr macros for ltree objects */
#define DatumGetLtreeP(X) ((ltree *) PG_DETOAST_DATUM(X))
diff --git a/contrib/ltree/ltxtquery_op.c b/contrib/ltree/ltxtquery_op.c
index 002102c9c75..e3666a2d46e 100644
--- a/contrib/ltree/ltxtquery_op.c
+++ b/contrib/ltree/ltxtquery_op.c
@@ -58,19 +58,19 @@ checkcondition_str(void *checkval, ITEM *val)
ltree_level *level = LTREE_FIRST(((CHKVAL *) checkval)->node);
int tlen = ((CHKVAL *) checkval)->node->numlevel;
char *op = ((CHKVAL *) checkval)->operand + val->distance;
- int (*cmpptr) (const char *, const char *, size_t);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t);
- cmpptr = (val->flag & LVAR_INCASE) ? ltree_strncasecmp : strncmp;
+ prefix_eq = (val->flag & LVAR_INCASE) ? ltree_prefix_eq_ci : ltree_prefix_eq;
while (tlen > 0)
{
if (val->flag & LVAR_SUBLEXEME)
{
- if (compare_subnode(level, op, val->length, cmpptr, (val->flag & LVAR_ANYEND)))
+ if (compare_subnode(level, op, val->length, prefix_eq, (val->flag & LVAR_ANYEND)))
return true;
}
else if ((val->length == level->len ||
(level->len > val->length && (val->flag & LVAR_ANYEND))) &&
- (*cmpptr) (op, level->name, val->length) == 0)
+ (*prefix_eq) (op, val->length, level->name, level->len))
return true;
tlen--;
--
2.43.0
From 785323a138398465bb10e8ecdda5fef9cd19edd1 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Thu, 4 Dec 2025 10:38:11 -0800
Subject: [PATCH v12 4/8] Fix inconsistency between ltree_strncasecmp() and
ltree_crc32_sz().
Previously, ltree_strncasecmp() used lowercasing with the default
collation; while ltree_crc32_sz used tolower() directly. These were
equivalent only if the default collation provider was libc and the
encoding is single-byte.
Change both to use casefolding with the default collation.
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
contrib/ltree/crc32.c | 46 ++++++++++++++++++++++++++++++++-------
contrib/ltree/lquery_op.c | 39 ++++++++++++++++++++++++++++++---
2 files changed, 74 insertions(+), 11 deletions(-)
diff --git a/contrib/ltree/crc32.c b/contrib/ltree/crc32.c
index 134f46a805e..3918d4a0ec2 100644
--- a/contrib/ltree/crc32.c
+++ b/contrib/ltree/crc32.c
@@ -10,31 +10,61 @@
#include "postgres.h"
#include "ltree.h"
+#include "crc32.h"
+#include "utils/pg_crc.h"
#ifdef LOWER_NODE
-#include <ctype.h>
-#define TOLOWER(x) tolower((unsigned char) (x))
-#else
-#define TOLOWER(x) (x)
+#include "utils/pg_locale.h"
#endif
-#include "crc32.h"
-#include "utils/pg_crc.h"
+#ifdef LOWER_NODE
unsigned int
ltree_crc32_sz(const char *buf, int size)
{
pg_crc32 crc;
const char *p = buf;
+ static pg_locale_t locale = NULL;
+
+ if (!locale)
+ locale = pg_database_locale();
INIT_TRADITIONAL_CRC32(crc);
while (size > 0)
{
- char c = (char) TOLOWER(*p);
+ char foldstr[UNICODE_CASEMAP_BUFSZ];
+ int srclen = pg_mblen(p);
+ size_t foldlen;
+
+ /* fold one codepoint at a time */
+ foldlen = pg_strfold(foldstr, UNICODE_CASEMAP_BUFSZ, p, srclen,
+ locale);
+
+ COMP_TRADITIONAL_CRC32(crc, foldstr, foldlen);
+
+ size -= srclen;
+ p += srclen;
+ }
+ FIN_TRADITIONAL_CRC32(crc);
+ return (unsigned int) crc;
+}
+
+#else
- COMP_TRADITIONAL_CRC32(crc, &c, 1);
+unsigned int
+ltree_crc32_sz(const char *buf, int size)
+{
+ pg_crc32 crc;
+ const char *p = buf;
+
+ INIT_TRADITIONAL_CRC32(crc);
+ while (size > 0)
+ {
+ COMP_TRADITIONAL_CRC32(crc, p, 1);
size--;
p++;
}
FIN_TRADITIONAL_CRC32(crc);
return (unsigned int) crc;
}
+
+#endif /* !LOWER_NODE */
diff --git a/contrib/ltree/lquery_op.c b/contrib/ltree/lquery_op.c
index d58e34769b8..8abd0de1a9c 100644
--- a/contrib/ltree/lquery_op.c
+++ b/contrib/ltree/lquery_op.c
@@ -94,11 +94,44 @@ ltree_prefix_eq(const char *a, size_t a_sz, const char *b, size_t b_sz)
bool
ltree_prefix_eq_ci(const char *a, size_t a_sz, const char *b, size_t b_sz)
{
- char *al = str_tolower(a, a_sz, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, b_sz, DEFAULT_COLLATION_OID);
+ static pg_locale_t locale = NULL;
+ size_t al_sz = a_sz + 1;
+ size_t al_len;
+ char *al = palloc(al_sz);
+ size_t bl_sz = b_sz + 1;
+ size_t bl_len;
+ char *bl = palloc(bl_sz);
bool res;
- res = (strncmp(al, bl, a_sz) == 0);
+ if (!locale)
+ locale = pg_database_locale();
+
+ /* casefold both a and b */
+
+ al_len = pg_strfold(al, al_sz, a, a_sz, locale);
+ if (al_len + 1 > al_sz)
+ {
+ /* grow buffer if needed and retry */
+ al_sz = al_len + 1;
+ al = repalloc(al, al_sz);
+ al_len = pg_strfold(al, al_sz, a, a_sz, locale);
+ Assert(al_len + 1 <= al_sz);
+ }
+
+ bl_len = pg_strfold(bl, bl_sz, b, b_sz, locale);
+ if (bl_len + 1 > bl_sz)
+ {
+ /* grow buffer if needed and retry */
+ bl_sz = bl_len + 1;
+ bl = repalloc(bl, bl_sz);
+ bl_len = pg_strfold(bl, bl_sz, b, b_sz, locale);
+ Assert(bl_len + 1 <= bl_sz);
+ }
+
+ if (al_len > bl_len)
+ res = false;
+ else
+ res = (strncmp(al, bl, al_len) == 0);
pfree(al);
pfree(bl);
--
2.43.0
From 93c283ff78c32719e2a50f60efc829bb5998e9da Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 20 Oct 2025 16:32:18 -0700
Subject: [PATCH v12 5/8] downcase_identifier(): use method table from locale
provider.
Previously, libc's tolower() was always used for lowercasing
identifiers, regardless of the database locale (though only characters
beyond 127 in single-byte encodings were affected). Refactor to allow
each provider to supply its own implementation of identifier
downcasing.
For historical compatibility, when using a single-byte encoding, ICU
still relies on tolower().
One minor behavior change is that, before the database default locale
is initialized, it uses ASCII semantics to downcase the
identifiers. Previously, it would use the postmaster's LC_CTYPE
setting from the environment. While that could have some effect during
GUC processing, for example, it would have been fragile to rely on the
environment setting anyway. (Also, it only matters when the encoding
is single-byte.)
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
src/backend/parser/scansup.c | 36 ++++++++---------------
src/backend/utils/adt/pg_locale.c | 20 +++++++++++++
src/backend/utils/adt/pg_locale_builtin.c | 2 ++
src/backend/utils/adt/pg_locale_icu.c | 36 ++++++++++++++++++++++-
src/backend/utils/adt/pg_locale_libc.c | 33 +++++++++++++++++++++
src/include/utils/pg_locale.h | 5 ++++
6 files changed, 107 insertions(+), 25 deletions(-)
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index 2feb2b6cf5a..d63cb865260 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -18,6 +18,7 @@
#include "mb/pg_wchar.h"
#include "parser/scansup.h"
+#include "utils/pg_locale.h"
/*
@@ -46,35 +47,22 @@ char *
downcase_identifier(const char *ident, int len, bool warn, bool truncate)
{
char *result;
- int i;
- bool enc_is_single_byte;
-
- result = palloc(len + 1);
- enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ size_t needed pg_attribute_unused();
/*
- * SQL99 specifies Unicode-aware case normalization, which we don't yet
- * have the infrastructure for. Instead we use tolower() to provide a
- * locale-aware translation. However, there are some locales where this
- * is not right either (eg, Turkish may do strange things with 'i' and
- * 'I'). Our current compromise is to use tolower() for characters with
- * the high bit set, as long as they aren't part of a multi-byte
- * character, and use an ASCII-only downcasing for 7-bit characters.
+ * Preserves string length.
+ *
+ * NB: if we decide to support Unicode-aware identifier case folding, then
+ * we need to account for a change in string length.
*/
- for (i = 0; i < len; i++)
- {
- unsigned char ch = (unsigned char) ident[i];
+ result = palloc(len + 1);
- if (ch >= 'A' && ch <= 'Z')
- ch += 'a' - 'A';
- else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
- ch = tolower(ch);
- result[i] = (char) ch;
- }
- result[i] = '\0';
+ needed = pg_downcase_ident(result, len + 1, ident, len);
+ Assert(needed == len);
+ Assert(result[len] == '\0');
- if (i >= NAMEDATALEN && truncate)
- truncate_identifier(result, i, warn);
+ if (len >= NAMEDATALEN && truncate)
+ truncate_identifier(result, len, warn);
return result;
}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 8a3796aa5d0..ee08ac045b7 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1352,6 +1352,26 @@ pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
}
+/*
+ * Lowercase an identifier using the database default locale.
+ *
+ * For historical reasons, does not use ordinary locale behavior. Should only
+ * be used for identifiers. XXX: can we make this equivalent to
+ * pg_strfold(..., default_locale)?
+ */
+size_t
+pg_downcase_ident(char *dst, size_t dstsize, const char *src, ssize_t srclen)
+{
+ pg_locale_t locale = default_locale;
+
+ if (locale == NULL || locale->ctype == NULL ||
+ locale->ctype->downcase_ident == NULL)
+ return strlower_c(dst, dstsize, src, srclen);
+ else
+ return locale->ctype->downcase_ident(dst, dstsize, src, srclen,
+ locale);
+}
+
/*
* pg_strcoll
*
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 0c2920112bb..145b4641b1b 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -208,6 +208,8 @@ static const struct ctype_methods ctype_methods_builtin = {
.strtitle = strtitle_builtin,
.strupper = strupper_builtin,
.strfold = strfold_builtin,
+ /* uses plain ASCII semantics for historical reasons */
+ .downcase_ident = NULL,
.wc_isdigit = wc_isdigit_builtin,
.wc_isalpha = wc_isalpha_builtin,
.wc_isalnum = wc_isalnum_builtin,
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 18d026deda8..69f22b47a68 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -61,6 +61,8 @@ static size_t strupper_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
static size_t strfold_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
+static size_t downcase_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
static int strncoll_icu(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2,
pg_locale_t locale);
@@ -123,7 +125,7 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
- * UChar32, which is correct for the UTF-8 encoding, but not in general.
+ * UChar32, which is correct for UTF-8 and LATIN1, but not in general.
*/
static pg_wchar
@@ -227,6 +229,7 @@ static const struct ctype_methods ctype_methods_icu = {
.strtitle = strtitle_icu,
.strupper = strupper_icu,
.strfold = strfold_icu,
+ .downcase_ident = downcase_ident_icu,
.wc_isdigit = wc_isdigit_icu,
.wc_isalpha = wc_isalpha_icu,
.wc_isalnum = wc_isalnum_icu,
@@ -564,6 +567,37 @@ strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return result_len;
}
+/*
+ * For historical compatibility, behavior is not multibyte-aware.
+ *
+ * NB: uses libc tolower() for single-byte encodings (also for historical
+ * compatibility), and therefore relies on the global LC_CTYPE setting.
+ */
+static size_t
+downcase_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+ bool enc_is_single_byte;
+
+ enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch = pg_ascii_tolower(ch);
+ else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
+ ch = tolower(ch);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
/*
* strncoll_icu_utf8
*
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 3baa5816b5f..ab6117aaace 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -318,12 +318,41 @@ tolower_libc_mb(pg_wchar wc, pg_locale_t locale)
return wc;
}
+/*
+ * Characters A..Z always downcase to a..z, even in the Turkish
+ * locale. Characters beyond 127 use tolower().
+ */
+static size_t
+downcase_ident_libc_sb(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ locale_t loc = locale->lt;
+ int i;
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch = pg_ascii_tolower(ch);
+ else if (IS_HIGHBIT_SET(ch) && isupper_l(ch, loc))
+ ch = tolower_l(ch, loc);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
static const struct ctype_methods ctype_methods_libc_sb = {
.strlower = strlower_libc_sb,
.strtitle = strtitle_libc_sb,
.strupper = strupper_libc_sb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_sb,
+ .downcase_ident = downcase_ident_libc_sb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -349,6 +378,8 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.strupper = strupper_libc_mb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_mb,
+ /* uses plain ASCII semantics for historical reasons */
+ .downcase_ident = NULL,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -370,6 +401,8 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.strupper = strupper_libc_mb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_mb,
+ /* uses plain ASCII semantics for historical reasons */
+ .downcase_ident = NULL,
.wc_isdigit = wc_isdigit_libc_mb,
.wc_isalpha = wc_isalpha_libc_mb,
.wc_isalnum = wc_isalnum_libc_mb,
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 01f891def7a..614affa1e91 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -110,6 +110,9 @@ struct ctype_methods
size_t (*strfold) (char *dest, size_t destsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+ size_t (*downcase_ident) (char *dest, size_t destsize,
+ const char *src, ssize_t srclen,
+ pg_locale_t locale);
/* required */
bool (*wc_isdigit) (pg_wchar wc, pg_locale_t locale);
@@ -188,6 +191,8 @@ extern size_t pg_strupper(char *dst, size_t dstsize,
extern size_t pg_strfold(char *dst, size_t dstsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+extern size_t pg_downcase_ident(char *dst, size_t dstsize,
+ const char *src, ssize_t srclen);
extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
extern int pg_strncoll(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2, pg_locale_t locale);
--
2.43.0
From 82ce5b3d0ebea2b41806710ffe4aa2e1c5240861 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Sun, 26 Oct 2025 15:12:38 -0700
Subject: [PATCH v12 6/8] Avoid global LC_CTYPE dependency in pg_locale_icu.c.
ICU still depends on libc for compatibility with certain historical
behavior for single-byte encodings. Make the dependency explicit by
holding a locale_t object when required.
We should consider a better solution in the future, such as decoding
the text to UTF-32 and using u_tolower(). That would require
additional infrastructure though; so for now, just avoid the global
LC_CTYPE dependency.
https://postgr.es/m/[email protected]
---
src/backend/utils/adt/pg_locale_icu.c | 47 ++++++++++++++++++++++++---
src/include/utils/pg_locale.h | 1 +
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 69f22b47a68..43d44fe43bd 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -244,6 +244,29 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_toupper = toupper_icu,
.wc_tolower = tolower_icu,
};
+
+/*
+ * ICU still depends on libc for compatibility with certain historical
+ * behavior for single-byte encodings. See downcase_ident_icu().
+ *
+ * XXX: consider fixing by decoding the single byte into a code point, and
+ * using u_tolower().
+ */
+static locale_t
+make_libc_ctype_locale(const char *ctype)
+{
+ locale_t loc;
+
+#ifndef WIN32
+ loc = newlocale(LC_CTYPE_MASK, ctype, NULL);
+#else
+ loc = _create_locale(LC_ALL, ctype);
+#endif
+ if (!loc)
+ report_newlocale_failure(ctype);
+
+ return loc;
+}
#endif
pg_locale_t
@@ -254,6 +277,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
const char *iculocstr;
const char *icurules = NULL;
UCollator *collator;
+ locale_t loc = (locale_t) 0;
pg_locale_t result;
if (collid == DEFAULT_COLLATION_OID)
@@ -276,6 +300,18 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
if (!isnull)
icurules = TextDatumGetCString(datum);
+ /* libc only needed for default locale and single-byte encoding */
+ if (pg_database_encoding_max_length() == 1)
+ {
+ const char *ctype;
+
+ datum = SysCacheGetAttrNotNull(DATABASEOID, tp,
+ Anum_pg_database_datctype);
+ ctype = TextDatumGetCString(datum);
+
+ loc = make_libc_ctype_locale(ctype);
+ }
+
ReleaseSysCache(tp);
}
else
@@ -306,6 +342,7 @@ create_pg_locale_icu(Oid collid, MemoryContext context)
result = MemoryContextAllocZero(context, sizeof(struct pg_locale_struct));
result->icu.locale = MemoryContextStrdup(context, iculocstr);
result->icu.ucol = collator;
+ result->icu.lt = loc;
result->deterministic = deterministic;
result->collate_is_c = false;
result->ctype_is_c = false;
@@ -578,17 +615,19 @@ downcase_ident_icu(char *dst, size_t dstsize, const char *src,
ssize_t srclen, pg_locale_t locale)
{
int i;
- bool enc_is_single_byte;
+ bool libc_lower;
+ locale_t lt = locale->icu.lt;
+
+ libc_lower = lt && (pg_database_encoding_max_length() == 1);
- enc_is_single_byte = pg_database_encoding_max_length() == 1;
for (i = 0; i < srclen && i < dstsize; i++)
{
unsigned char ch = (unsigned char) src[i];
if (ch >= 'A' && ch <= 'Z')
ch = pg_ascii_tolower(ch);
- else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
- ch = tolower(ch);
+ else if (libc_lower && IS_HIGHBIT_SET(ch) && isupper_l(ch, lt))
+ ch = tolower_l(ch, lt);
dst[i] = (char) ch;
}
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 614affa1e91..8ad8900cf93 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -167,6 +167,7 @@ struct pg_locale_struct
{
const char *locale;
UCollator *ucol;
+ locale_t lt;
} icu;
#endif
};
--
2.43.0
From 8bea39a2780283d4afdd75e0eb4a01b50d524faf Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 13:24:38 -0800
Subject: [PATCH v12 7/8] fuzzystrmatch: use pg_ascii_toupper().
fuzzystrmatch is designed for ASCII, so no need to rely on the global
LC_CTYPE setting.
TODO: what about \xc7 case? Also, what should the behavior be for
soundex()?
Discussion: https://postgr.es/m/[email protected]
---
contrib/fuzzystrmatch/dmetaphone.c | 2 +-
contrib/fuzzystrmatch/fuzzystrmatch.c | 43 +++++++++++++++------------
2 files changed, 25 insertions(+), 20 deletions(-)
diff --git a/contrib/fuzzystrmatch/dmetaphone.c b/contrib/fuzzystrmatch/dmetaphone.c
index 227d8b11ddc..5e8ee2b0354 100644
--- a/contrib/fuzzystrmatch/dmetaphone.c
+++ b/contrib/fuzzystrmatch/dmetaphone.c
@@ -284,7 +284,7 @@ MakeUpper(metastring *s)
char *i;
for (i = s->str; *i; i++)
- *i = toupper((unsigned char) *i);
+ *i = pg_ascii_toupper((unsigned char) *i);
}
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index e7cc314b763..319302af0e4 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -62,7 +62,7 @@ static const char *const soundex_table = "01230120022455012623010202";
static char
soundex_code(char letter)
{
- letter = toupper((unsigned char) letter);
+ letter = pg_ascii_toupper((unsigned char) letter);
/* Defend against non-ASCII letters */
if (letter >= 'A' && letter <= 'Z')
return soundex_table[letter - 'A'];
@@ -122,16 +122,21 @@ static const char _codes[26] = {
static int
getcode(char c)
{
- if (isalpha((unsigned char) c))
- {
- c = toupper((unsigned char) c);
- /* Defend against non-ASCII letters */
- if (c >= 'A' && c <= 'Z')
- return _codes[c - 'A'];
- }
+ c = pg_ascii_toupper((unsigned char) c);
+ /* Defend against non-ASCII letters */
+ if (c >= 'A' && c <= 'Z')
+ return _codes[c - 'A'];
+
return 0;
}
+static bool
+ascii_isalpha(char c)
+{
+ return (c >= 'A' && c <= 'Z') ||
+ (c >= 'a' && c <= 'z');
+}
+
#define isvowel(c) (getcode(c) & 1) /* AEIOU */
/* These letters are passed through unchanged */
@@ -301,18 +306,18 @@ metaphone(PG_FUNCTION_ARGS)
* accessing the array directly... */
/* Look at the next letter in the word */
-#define Next_Letter (toupper((unsigned char) word[w_idx+1]))
+#define Next_Letter (pg_ascii_toupper((unsigned char) word[w_idx+1]))
/* Look at the current letter in the word */
-#define Curr_Letter (toupper((unsigned char) word[w_idx]))
+#define Curr_Letter (pg_ascii_toupper((unsigned char) word[w_idx]))
/* Go N letters back. */
#define Look_Back_Letter(n) \
- (w_idx >= (n) ? toupper((unsigned char) word[w_idx-(n)]) : '\0')
+ (w_idx >= (n) ? pg_ascii_toupper((unsigned char) word[w_idx-(n)]) : '\0')
/* Previous letter. I dunno, should this return null on failure? */
#define Prev_Letter (Look_Back_Letter(1))
/* Look two letters down. It makes sure you don't walk off the string. */
#define After_Next_Letter \
- (Next_Letter != '\0' ? toupper((unsigned char) word[w_idx+2]) : '\0')
-#define Look_Ahead_Letter(n) toupper((unsigned char) Lookahead(word+w_idx, n))
+ (Next_Letter != '\0' ? pg_ascii_toupper((unsigned char) word[w_idx+2]) : '\0')
+#define Look_Ahead_Letter(n) pg_ascii_toupper((unsigned char) Lookahead(word+w_idx, n))
/* Allows us to safely look ahead an arbitrary # of letters */
@@ -340,7 +345,7 @@ Lookahead(char *word, int how_far)
#define Phone_Len (p_idx)
/* Note is a letter is a 'break' in the word */
-#define Isbreak(c) (!isalpha((unsigned char) (c)))
+#define Isbreak(c) (!ascii_isalpha((unsigned char) (c)))
static void
@@ -379,7 +384,7 @@ _metaphone(char *word, /* IN */
/*-- The first phoneme has to be processed specially. --*/
/* Find our first letter */
- for (; !isalpha((unsigned char) (Curr_Letter)); w_idx++)
+ for (; !ascii_isalpha((unsigned char) (Curr_Letter)); w_idx++)
{
/* On the off chance we were given nothing but crap... */
if (Curr_Letter == '\0')
@@ -478,7 +483,7 @@ _metaphone(char *word, /* IN */
*/
/* Ignore non-alphas */
- if (!isalpha((unsigned char) (Curr_Letter)))
+ if (!ascii_isalpha((unsigned char) (Curr_Letter)))
continue;
/* Drop duplicates, except CC */
@@ -731,7 +736,7 @@ _soundex(const char *instr, char *outstr)
Assert(outstr);
/* Skip leading non-alphabetic characters */
- while (*instr && !isalpha((unsigned char) *instr))
+ while (*instr && !ascii_isalpha((unsigned char) *instr))
++instr;
/* If no string left, return all-zeroes buffer */
@@ -742,12 +747,12 @@ _soundex(const char *instr, char *outstr)
}
/* Take the first letter as is */
- *outstr++ = (char) toupper((unsigned char) *instr++);
+ *outstr++ = (char) pg_ascii_toupper((unsigned char) *instr++);
count = 1;
while (*instr && count < SOUNDEX_LEN)
{
- if (isalpha((unsigned char) *instr) &&
+ if (ascii_isalpha((unsigned char) *instr) &&
soundex_code(*instr) != soundex_code(*(instr - 1)))
{
*outstr = soundex_code(*instr);
--
2.43.0
From 68b05c20a68098613fbe6657ddb2b07c5ffd3d0b Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 24 Nov 2025 14:00:52 -0800
Subject: [PATCH v12 8/8] Control LC_COLLATE with GUC.
Now that the global LC_COLLATE setting is not used for any in-core
purpose at all (see commit 5e6e42e44f), allow it to be set with a
GUC. This may be useful for extensions or procedural languages that
still depend on the global LC_COLLATE setting.
TODO: needs discussion
Discussion: https://postgr.es/m/[email protected]
---
src/backend/utils/adt/pg_locale.c | 59 +++++++++++++++++++
src/backend/utils/init/postinit.c | 2 +
src/backend/utils/misc/guc_parameters.dat | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 3 +
src/include/utils/guc_hooks.h | 2 +
src/include/utils/pg_locale.h | 1 +
7 files changed, 78 insertions(+)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index ee08ac045b7..6dfbe8af47b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -81,6 +81,7 @@ extern pg_locale_t create_pg_locale_libc(Oid collid, MemoryContext context);
extern char *get_collation_actual_version_libc(const char *collcollate);
/* GUC settings */
+char *locale_collate;
char *locale_messages;
char *locale_monetary;
char *locale_numeric;
@@ -369,6 +370,64 @@ assign_locale_time(const char *newval, void *extra)
CurrentLCTimeValid = false;
}
+/*
+ * We allow LC_COLLATE to actually be set globally.
+ *
+ * Note: we normally disallow value = "" because it wouldn't have consistent
+ * semantics (it'd effectively just use the previous value). However, this
+ * is the value passed for PGC_S_DEFAULT, so don't complain in that case,
+ * not even if the attempted setting fails due to invalid environment value.
+ * The idea there is just to accept the environment setting *if possible*
+ * during startup, until we can read the proper value from postgresql.conf.
+ */
+bool
+check_locale_collate(char **newval, void **extra, GucSource source)
+{
+ int locale_enc;
+ int db_enc;
+
+ if (**newval == '\0')
+ {
+ if (source == PGC_S_DEFAULT)
+ return true;
+ else
+ return false;
+ }
+
+ locale_enc = pg_get_encoding_from_locale(*newval, true);
+ db_enc = GetDatabaseEncoding();
+
+ if (!(locale_enc == db_enc ||
+ locale_enc == PG_SQL_ASCII ||
+ db_enc == PG_SQL_ASCII ||
+ locale_enc == -1))
+ {
+ if (source == PGC_S_FILE)
+ {
+ guc_free(*newval);
+ *newval = guc_strdup(LOG, "C");
+ if (!*newval)
+ return false;
+ }
+ else if (source != PGC_S_TEST)
+ {
+ ereport(WARNING,
+ (errmsg("encoding mismatch"),
+ errdetail("Locale \"%s\" uses encoding \"%s\", which does not match database encoding \"%s\".",
+ *newval, pg_encoding_to_char(locale_enc), pg_encoding_to_char(db_enc))));
+ return false;
+ }
+ }
+
+ return check_locale(LC_COLLATE, *newval, NULL);
+}
+
+void
+assign_locale_collate(const char *newval, void *extra)
+{
+ (void) pg_perm_setlocale(LC_COLLATE, newval);
+}
+
/*
* We allow LC_MESSAGES to actually be set globally.
*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4ed69ac7ba2..8586832acaa 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -404,6 +404,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
* the pg_database tuple.
*/
SetDatabaseEncoding(dbform->encoding);
+ /* Reset lc_collate to check encoding, and fall back to C if necessary */
+ SetConfigOption("lc_collate", locale_collate, PGC_POSTMASTER, PGC_S_FILE);
/* Record it as a GUC internal option, too */
SetConfigOption("server_encoding", GetDatabaseEncodingName(),
PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..a36c680719f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1457,6 +1457,15 @@
boot_val => 'PG_KRB_SRVTAB',
},
+{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
+ short_desc => 'Sets the locale for text ordering in extensions.',
+ long_desc => 'An empty string means use the operating system setting.',
+ variable => 'locale_collate',
+ boot_val => '""',
+ check_hook => 'check_locale_collate',
+ assign_hook => 'assign_locale_collate',
+},
+
{ name => 'lc_messages', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
short_desc => 'Sets the language in which messages are displayed.',
long_desc => 'An empty string means use the operating system setting.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..19332e39e82 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -798,6 +798,8 @@
# encoding
# These settings are initialized by initdb, but they can be changed.
+#lc_collate = '' # locale for text ordering (only affects
+ # extensions)
#lc_messages = '' # locale for system error message
# strings
#lc_monetary = 'C' # locale for monetary formatting
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..8b2e7bfab6f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1312,6 +1312,9 @@ setup_config(void)
conflines = replace_guc_value(conflines, "shared_buffers",
repltok, false);
+ conflines = replace_guc_value(conflines, "lc_collate",
+ lc_collate, false);
+
conflines = replace_guc_value(conflines, "lc_messages",
lc_messages, false);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..8a20f76eec8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -65,6 +65,8 @@ extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
extern void assign_io_method(int newval, void *extra);
extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
extern const char *show_in_hot_standby(void);
+extern bool check_locale_collate(char **newval, void **extra, GucSource source);
+extern void assign_locale_collate(const char *newval, void *extra);
extern bool check_locale_messages(char **newval, void **extra, GucSource source);
extern void assign_locale_messages(const char *newval, void *extra);
extern bool check_locale_monetary(char **newval, void **extra, GucSource source);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 8ad8900cf93..e29497dc7d2 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -41,6 +41,7 @@
#define UNICODE_CASEMAP_BUFSZ (UNICODE_CASEMAP_LEN * sizeof(char32_t))
/* GUC settings */
+extern PGDLLIMPORT char *locale_collate;
extern PGDLLIMPORT char *locale_messages;
extern PGDLLIMPORT char *locale_monetary;
extern PGDLLIMPORT char *locale_numeric;
--
2.43.0