On Wed, 2025-11-26 at 09:50 +0800, Chao Li wrote:
> * The retry logic implies that a single-byte char may become multiple
> bytes after folding, otherwise retry is not needed because you have
> allocated s+1 bytes for dest buffers. From this perspective, we
> should use two needed variables: neededA and neededB, if neededA !=
> neededB, then the two strings are different; if neededA == neededB,
> then we should be perform strncmp, but here we should pass neededA
> (or neededB as they are identical) to strncmp(al, bl, neededA).
Thank you.
It's actually worse than that: having a single 's' argument is just
completely wrong. Consider:
a: U&'x\0394\0394\0394'
b: U&'\0394\0394\0394'
There is no value for byte length 's' for which both 'a' and 'b' are
properly-encoded strings. So, the current code passes invalid byte
sequences to LOWER(), which is another pre-existing bug.
ltree_strncasecmp() is only used for checking equality of the first s
bytes of the query, so let's make it a safer API that just checks
prefix equality. Attached.
> * Based on the logic you implemented in 0004, first pg_strfold() has
> copied as many chars as possible to dest buffer, so when retry,
> ideally we can should resume instead of start over. However, if
> single-byte->multi-byte folding happens, we have no information to
> decide from where to resume.
Right.
That suggests that we might want some kind of lazy or iterator-based
API for string folding. We'd need to find the right way to do that with
ICU. If we find that it's a performance problem somewhere, we can look
into that. Do you think we need that now?
> From this perspective, in 0004, do we really need to take the try-
> the-best strategy for strlower_c()? If there are some other use cases
> that require data to be placed in dest buffer even if dest buffer
> doesn’t have enough space, then my patch [1] of changing
> strlower_libc_sb() should be considered.
I will look into that.
> SB_lower_char should be changed to C_IMatchText.
Updated comment.
> I think the comment should be updated accordingly, like “for ILIKE in
> the C locale”.
Done, thank you.
> * match is allocated with pattlen+1 bytes, is it long enough to hold
> pattlen multiple-byte chars?
>
> * match is not freed, but looks like it should be:
...
> Should “pos” be “part[pos]” assigning to match[match_pos++]?
All fixed, thank you! (I apologize for posting a patch in that state to
begin with...)
I also reorganized slightly to separate out the pg_iswcased() API into
its own patch, and moved the like_support.c changes from the ctype_is_c
patch (already committed: 1476028225) into the pattern prefixes patch.
Regards,
Jeff Davis
From a70ce0d50ae47ddaf3c310ebf94d24fdc642e074 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 24 Nov 2025 09:09:06 -0800
Subject: [PATCH v11 1/9] Change some callers to use pg_ascii_toupper().
The input is ASCII anyway, so it's better to be clear that it's not
locale-dependent.
Discussion: https://postgr.es/m/[email protected]
---
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/utils/adt/cash.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..a50345f9bf7 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -479,7 +479,7 @@ pg_split_walfile_name(PG_FUNCTION_ARGS)
/* Capitalize WAL file name. */
for (p = fname_upper; *p; p++)
- *p = pg_toupper((unsigned char) *p);
+ *p = pg_ascii_toupper((unsigned char) *p);
if (!IsXLogFileName(fname_upper))
ereport(ERROR,
diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index 611d23f3cb0..623f6eec056 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -1035,7 +1035,7 @@ cash_words(PG_FUNCTION_ARGS)
appendStringInfoString(&buf, m0 == 1 ? " cent" : " cents");
/* capitalize output */
- buf.data[0] = pg_toupper((unsigned char) buf.data[0]);
+ buf.data[0] = pg_ascii_toupper((unsigned char) buf.data[0]);
/* return as text datum */
res = cstring_to_text_with_len(buf.data, buf.len);
--
2.43.0
From de4c04590f095d543cc24217945a259236ea866f Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 21 Nov 2025 12:41:47 -0800
Subject: [PATCH v11 2/9] Make regex "max_chr" depend on encoding, not
provider.
The previous per-provider "max_chr" field was there as a hack to
preserve the exact prior behavior, which depended on the
provider. Change to depend on the encoding, which makes more sense,
and remove the per-provider logic.
The only difference is for ICU: previously it always used
MAX_SIMPLE_CHR (0x7FF) regardless of the encoding; whereas now it will
match libc and use MAX_SIMPLE_CHR for UTF-8, and MAX_UCHAR for other
encodings. That's possibly a loss for non-UTF8 multibyte encodings,
but a win for single-byte encodings. Regardless, this distinction was
not worth the complexity.
Discussion: https://postgr.es/m/[email protected]
Reviewed-by: Chao Li <[email protected]>
---
src/backend/regex/regc_pg_locale.c | 18 ++++++++++--------
src/backend/utils/adt/pg_locale_libc.c | 2 --
src/include/utils/pg_locale.h | 6 ------
3 files changed, 10 insertions(+), 16 deletions(-)
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 4698f110a0c..bb0e3f1d139 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -320,16 +320,18 @@ regc_ctype_get_cache(regc_wc_probefunc probefunc, int cclasscode)
max_chr = (pg_wchar) MAX_SIMPLE_CHR;
#endif
}
+ else if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+ }
else
{
- if (pg_regex_locale->ctype->max_chr != 0 &&
- pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR)
- {
- max_chr = pg_regex_locale->ctype->max_chr;
- pcc->cv.cclasscode = -1;
- }
- else
- max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#if MAX_SIMPLE_CHR >= UCHAR_MAX
+ max_chr = (pg_wchar) UCHAR_MAX;
+ pcc->cv.cclasscode = -1;
+#else
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
}
/*
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index e2beee44335..6ad3f93b543 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -342,7 +342,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
- .max_chr = UCHAR_MAX,
};
/*
@@ -369,7 +368,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
- .max_chr = UCHAR_MAX,
};
static const struct ctype_methods ctype_methods_libc_utf8 = {
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 54193a17a90..42e21e7fb8a 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -134,12 +134,6 @@ struct ctype_methods
* pg_strlower().
*/
char (*char_tolower) (unsigned char ch, pg_locale_t locale);
-
- /*
- * For regex and pattern matching efficiency, the maximum char value
- * supported by the above methods. If zero, limit is set by regex code.
- */
- pg_wchar max_chr;
};
/*
--
2.43.0
From d485548107cc9c5833185932d462febe8fdf7ef1 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 10:20:36 -0800
Subject: [PATCH v11 3/9] Fix inconsistency between ltree_strncasecmp() and
ltree_crc32_sz().
Previously, ltree_strncasecmp() used lowercasing with the default
collation; while ltree_crc32_sz used tolower() directly. These were
equivalent only if the default collation provider was libc and the
encoding is single-byte.
Change both to use casefolding with the default collation.
Discussion: https://postgr.es/m/[email protected]
Reviewed-by: Chao Li <[email protected]>
---
contrib/ltree/crc32.c | 46 ++++++++++++++++++----
contrib/ltree/lquery_op.c | 74 ++++++++++++++++++++++++++++++------
contrib/ltree/ltree.h | 6 ++-
contrib/ltree/ltxtquery_op.c | 8 ++--
4 files changed, 108 insertions(+), 26 deletions(-)
diff --git a/contrib/ltree/crc32.c b/contrib/ltree/crc32.c
index 134f46a805e..3918d4a0ec2 100644
--- a/contrib/ltree/crc32.c
+++ b/contrib/ltree/crc32.c
@@ -10,31 +10,61 @@
#include "postgres.h"
#include "ltree.h"
+#include "crc32.h"
+#include "utils/pg_crc.h"
#ifdef LOWER_NODE
-#include <ctype.h>
-#define TOLOWER(x) tolower((unsigned char) (x))
-#else
-#define TOLOWER(x) (x)
+#include "utils/pg_locale.h"
#endif
-#include "crc32.h"
-#include "utils/pg_crc.h"
+#ifdef LOWER_NODE
unsigned int
ltree_crc32_sz(const char *buf, int size)
{
pg_crc32 crc;
const char *p = buf;
+ static pg_locale_t locale = NULL;
+
+ if (!locale)
+ locale = pg_database_locale();
INIT_TRADITIONAL_CRC32(crc);
while (size > 0)
{
- char c = (char) TOLOWER(*p);
+ char foldstr[UNICODE_CASEMAP_BUFSZ];
+ int srclen = pg_mblen(p);
+ size_t foldlen;
+
+ /* fold one codepoint at a time */
+ foldlen = pg_strfold(foldstr, UNICODE_CASEMAP_BUFSZ, p, srclen,
+ locale);
+
+ COMP_TRADITIONAL_CRC32(crc, foldstr, foldlen);
+
+ size -= srclen;
+ p += srclen;
+ }
+ FIN_TRADITIONAL_CRC32(crc);
+ return (unsigned int) crc;
+}
+
+#else
- COMP_TRADITIONAL_CRC32(crc, &c, 1);
+unsigned int
+ltree_crc32_sz(const char *buf, int size)
+{
+ pg_crc32 crc;
+ const char *p = buf;
+
+ INIT_TRADITIONAL_CRC32(crc);
+ while (size > 0)
+ {
+ COMP_TRADITIONAL_CRC32(crc, p, 1);
size--;
p++;
}
FIN_TRADITIONAL_CRC32(crc);
return (unsigned int) crc;
}
+
+#endif /* !LOWER_NODE */
diff --git a/contrib/ltree/lquery_op.c b/contrib/ltree/lquery_op.c
index a6466f575fd..ba8e114d742 100644
--- a/contrib/ltree/lquery_op.c
+++ b/contrib/ltree/lquery_op.c
@@ -41,7 +41,9 @@ getlexeme(char *start, char *end, int *len)
}
bool
-compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *, const char *, size_t), bool anyend)
+compare_subnode(ltree_level *t, char *qn, int len,
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t),
+ bool anyend)
{
char *endt = t->name + t->len;
char *endq = qn + len;
@@ -57,7 +59,7 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
while ((tn = getlexeme(tn, endt, &lent)) != NULL)
{
if ((lent == lenq || (lent > lenq && anyend)) &&
- (*cmpptr) (qn, tn, lenq) == 0)
+ (*prefix_eq) (qn, lenq, tn, lent))
{
isok = true;
@@ -74,14 +76,62 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
return true;
}
-int
-ltree_strncasecmp(const char *a, const char *b, size_t s)
+/*
+ * Check if b has a prefix of a.
+ */
+bool
+ltree_prefix_eq(const char *a, size_t a_sz, const char *b, size_t b_sz)
+{
+ if (a_sz > b_sz)
+ return false;
+ else
+ return (strncmp(a, b, a_sz) == 0);
+}
+
+/*
+ * Case-insensitive check if b has a prefix of a.
+ */
+bool
+ltree_prefix_eq_ci(const char *a, size_t a_sz, const char *b, size_t b_sz)
{
- char *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
- int res;
+ static pg_locale_t locale = NULL;
+ size_t al_sz = a_sz + 1;
+ size_t al_len;
+ char *al = palloc(al_sz);
+ size_t bl_sz = b_sz + 1;
+ size_t bl_len;
+ char *bl = palloc(bl_sz);
+ bool res;
+
+ if (!locale)
+ locale = pg_database_locale();
+
+ /* casefold both a and b */
+
+ al_len = pg_strfold(al, al_sz, a, a_sz, locale);
+ if (al_len + 1 > al_sz)
+ {
+ /* grow buffer if needed and retry */
+ al_sz = al_len + 1;
+ al = repalloc(al, al_sz);
+ al_len = pg_strfold(al, al_sz, a, a_sz, locale);
+ Assert(al_len + 1 <= al_sz);
+ }
+
+ bl_len = pg_strfold(bl, bl_sz, b, b_sz, locale);
+ if (bl_len + 1 > bl_sz)
+ {
+ /* grow buffer if needed and retry */
+ bl_sz = bl_len + 1;
+ bl = repalloc(bl, bl_sz);
+ bl_len = pg_strfold(bl, bl_sz, b, b_sz, locale);
+ Assert(bl_len + 1 <= bl_sz);
+ }
- res = strncmp(al, bl, s);
+ if (al_len > bl_len)
+ res = false;
+ else
+ res = (strncmp(al, bl, al_len) == 0);
pfree(al);
pfree(bl);
@@ -109,19 +159,19 @@ checkLevel(lquery_level *curq, ltree_level *curt)
for (int i = 0; i < curq->numvar; i++)
{
- int (*cmpptr) (const char *, const char *, size_t);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t);
- cmpptr = (curvar->flag & LVAR_INCASE) ? ltree_strncasecmp : strncmp;
+ prefix_eq = (curvar->flag & LVAR_INCASE) ? ltree_prefix_eq_ci : ltree_prefix_eq;
if (curvar->flag & LVAR_SUBLEXEME)
{
- if (compare_subnode(curt, curvar->name, curvar->len, cmpptr,
+ if (compare_subnode(curt, curvar->name, curvar->len, prefix_eq,
(curvar->flag & LVAR_ANYEND)))
return success;
}
else if ((curvar->len == curt->len ||
(curt->len > curvar->len && (curvar->flag & LVAR_ANYEND))) &&
- (*cmpptr) (curvar->name, curt->name, curvar->len) == 0)
+ (*prefix_eq) (curvar->name, curvar->len, curt->name, curt->len))
return success;
curvar = LVAR_NEXT(curvar);
diff --git a/contrib/ltree/ltree.h b/contrib/ltree/ltree.h
index 5e0761641d3..08199ceb588 100644
--- a/contrib/ltree/ltree.h
+++ b/contrib/ltree/ltree.h
@@ -208,9 +208,11 @@ bool ltree_execute(ITEM *curitem, void *checkval,
int ltree_compare(const ltree *a, const ltree *b);
bool inner_isparent(const ltree *c, const ltree *p);
bool compare_subnode(ltree_level *t, char *qn, int len,
- int (*cmpptr) (const char *, const char *, size_t), bool anyend);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t),
+ bool anyend);
ltree *lca_inner(ltree **a, int len);
-int ltree_strncasecmp(const char *a, const char *b, size_t s);
+bool ltree_prefix_eq(const char *a, size_t a_sz, const char *b, size_t b_sz);
+bool ltree_prefix_eq_ci(const char *a, size_t a_sz, const char *b, size_t b_sz);
/* fmgr macros for ltree objects */
#define DatumGetLtreeP(X) ((ltree *) PG_DETOAST_DATUM(X))
diff --git a/contrib/ltree/ltxtquery_op.c b/contrib/ltree/ltxtquery_op.c
index 002102c9c75..e3666a2d46e 100644
--- a/contrib/ltree/ltxtquery_op.c
+++ b/contrib/ltree/ltxtquery_op.c
@@ -58,19 +58,19 @@ checkcondition_str(void *checkval, ITEM *val)
ltree_level *level = LTREE_FIRST(((CHKVAL *) checkval)->node);
int tlen = ((CHKVAL *) checkval)->node->numlevel;
char *op = ((CHKVAL *) checkval)->operand + val->distance;
- int (*cmpptr) (const char *, const char *, size_t);
+ bool (*prefix_eq) (const char *, size_t, const char *, size_t);
- cmpptr = (val->flag & LVAR_INCASE) ? ltree_strncasecmp : strncmp;
+ prefix_eq = (val->flag & LVAR_INCASE) ? ltree_prefix_eq_ci : ltree_prefix_eq;
while (tlen > 0)
{
if (val->flag & LVAR_SUBLEXEME)
{
- if (compare_subnode(level, op, val->length, cmpptr, (val->flag & LVAR_ANYEND)))
+ if (compare_subnode(level, op, val->length, prefix_eq, (val->flag & LVAR_ANYEND)))
return true;
}
else if ((val->length == level->len ||
(level->len > val->length && (val->flag & LVAR_ANYEND))) &&
- (*cmpptr) (op, level->name, val->length) == 0)
+ (*prefix_eq) (op, val->length, level->name, level->len))
return true;
tlen--;
--
2.43.0
From 509118852993a2b1132de7ee28d43143bcfcef11 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 18:16:41 -0800
Subject: [PATCH v11 4/9] Remove char_tolower() API.
It's only useful for an ILIKE optimization for the libc provider using
a single-byte encoding and a non-C locale, but it creates significant
internal complexity.
Discussion: https://postgr.es/m/[email protected]
---
src/backend/utils/adt/like.c | 42 +++++++++-----------------
src/backend/utils/adt/like_match.c | 18 ++++++-----
src/backend/utils/adt/pg_locale.c | 26 ----------------
src/backend/utils/adt/pg_locale_libc.c | 10 ------
src/include/utils/pg_locale.h | 9 ------
5 files changed, 25 insertions(+), 80 deletions(-)
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 4216ac17f43..4a7fc583c71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -43,8 +43,8 @@ static text *MB_do_like_escape(text *pat, text *esc);
static int UTF8_MatchText(const char *t, int tlen, const char *p, int plen,
pg_locale_t locale);
-static int SB_IMatchText(const char *t, int tlen, const char *p, int plen,
- pg_locale_t locale);
+static int C_IMatchText(const char *t, int tlen, const char *p, int plen,
+ pg_locale_t locale);
static int GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation);
static int Generic_Text_IC_like(text *str, text *pat, Oid collation);
@@ -84,22 +84,10 @@ wchareq(const char *p1, const char *p2)
* of getting a single character transformed to the system's wchar_t format.
* So now, we just downcase the strings using lower() and apply regular LIKE
* comparison. This should be revisited when we install better locale support.
- */
-
-/*
- * We do handle case-insensitive matching for single-byte encodings using
+ *
+ * We do handle case-insensitive matching for the C locale using
* fold-on-the-fly processing, however.
*/
-static char
-SB_lower_char(unsigned char c, pg_locale_t locale)
-{
- if (locale->ctype_is_c)
- return pg_ascii_tolower(c);
- else if (locale->is_default)
- return pg_tolower(c);
- else
- return char_tolower(c, locale);
-}
#define NextByte(p, plen) ((p)++, (plen)--)
@@ -131,9 +119,9 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
#include "like_match.c"
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
#define NextChar(p, plen) NextByte((p), (plen))
-#define MatchText SB_IMatchText
+#define MatchText C_IMatchText
#include "like_match.c"
@@ -202,22 +190,17 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
errmsg("nondeterministic collations are not supported for ILIKE")));
/*
- * For efficiency reasons, in the single byte case we don't call lower()
- * on the pattern and text, but instead call SB_lower_char on each
- * character. In the multi-byte case we don't have much choice :-(. Also,
- * ICU does not support single-character case folding, so we go the long
- * way.
+ * For efficiency reasons, in the C locale we don't call lower() on the
+ * pattern and text, but instead call SB_lower_char on each character.
*/
- if (locale->ctype_is_c ||
- (char_tolower_enabled(locale) &&
- pg_database_encoding_max_length() == 1))
+ if (locale->ctype_is_c)
{
p = VARDATA_ANY(pat);
plen = VARSIZE_ANY_EXHDR(pat);
s = VARDATA_ANY(str);
slen = VARSIZE_ANY_EXHDR(str);
- return SB_IMatchText(s, slen, p, plen, locale);
+ return C_IMatchText(s, slen, p, plen, locale);
}
else
{
@@ -229,10 +212,13 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
PointerGetDatum(str)));
s = VARDATA_ANY(str);
slen = VARSIZE_ANY_EXHDR(str);
+
if (GetDatabaseEncoding() == PG_UTF8)
return UTF8_MatchText(s, slen, p, plen, 0);
- else
+ else if (pg_database_encoding_max_length() > 1)
return MB_MatchText(s, slen, p, plen, 0);
+ else
+ return SB_MatchText(s, slen, p, plen, 0);
}
}
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index 892f8a745ea..54846c9541d 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -70,10 +70,14 @@
*--------------------
*/
+/*
+ * MATCH_LOWER is defined for ILIKE in the C locale as an optimization. Other
+ * locales must casefold the inputs before matching.
+ */
#ifdef MATCH_LOWER
-#define GETCHAR(t, locale) MATCH_LOWER(t, locale)
+#define GETCHAR(t) pg_ascii_tolower(t)
#else
-#define GETCHAR(t, locale) (t)
+#define GETCHAR(t) (t)
#endif
static int
@@ -105,7 +109,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
ereport(ERROR,
(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
errmsg("LIKE pattern must not end with escape character")));
- if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
+ if (GETCHAR(*p) != GETCHAR(*t))
return LIKE_FALSE;
}
else if (*p == '%')
@@ -167,14 +171,14 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
ereport(ERROR,
(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
errmsg("LIKE pattern must not end with escape character")));
- firstpat = GETCHAR(p[1], locale);
+ firstpat = GETCHAR(p[1]);
}
else
- firstpat = GETCHAR(*p, locale);
+ firstpat = GETCHAR(*p);
while (tlen > 0)
{
- if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
+ if (GETCHAR(*t) == firstpat || (locale && !locale->deterministic))
{
int matched = MatchText(t, tlen, p, plen, locale);
@@ -342,7 +346,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
NextChar(t1, t1len);
}
}
- else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
+ else if (GETCHAR(*p) != GETCHAR(*t))
{
/* non-wildcard pattern char fails to match text char */
return LIKE_FALSE;
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b02e7fa4f18..5aba277ba99 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1629,32 +1629,6 @@ char_is_cased(char ch, pg_locale_t locale)
return locale->ctype->char_is_cased(ch, locale);
}
-/*
- * char_tolower_enabled()
- *
- * Does the provider support char_tolower()?
- */
-bool
-char_tolower_enabled(pg_locale_t locale)
-{
- if (locale->ctype == NULL)
- return true;
- return (locale->ctype->char_tolower != NULL);
-}
-
-/*
- * char_tolower()
- *
- * Convert char (single-byte encoding) to lowercase.
- */
-char
-char_tolower(unsigned char ch, pg_locale_t locale)
-{
- if (locale->ctype == NULL)
- return pg_ascii_tolower(ch);
- return locale->ctype->char_tolower(ch, locale);
-}
-
/*
* Return required encoding ID for the given locale, or -1 if any encoding is
* valid for the locale.
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 6ad3f93b543..91a892bb540 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -248,13 +248,6 @@ wc_isxdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
#endif
}
-static char
-char_tolower_libc(unsigned char ch, pg_locale_t locale)
-{
- Assert(pg_database_encoding_max_length() == 1);
- return tolower_l(ch, locale->lt);
-}
-
static bool
char_is_cased_libc(char ch, pg_locale_t locale)
{
@@ -339,7 +332,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -365,7 +357,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -387,7 +378,6 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
};
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 42e21e7fb8a..50520e50127 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -127,13 +127,6 @@ struct ctype_methods
/* required */
bool (*char_is_cased) (char ch, pg_locale_t locale);
-
- /*
- * Optional. If defined, will only be called for single-byte encodings. If
- * not defined, or if the encoding is multibyte, will fall back to
- * pg_strlower().
- */
- char (*char_tolower) (unsigned char ch, pg_locale_t locale);
};
/*
@@ -185,8 +178,6 @@ extern pg_locale_t pg_newlocale_from_collation(Oid collid);
extern char *get_collation_actual_version(char collprovider, const char *collcollate);
extern bool char_is_cased(char ch, pg_locale_t locale);
-extern bool char_tolower_enabled(pg_locale_t locale);
-extern char char_tolower(unsigned char ch, pg_locale_t locale);
extern size_t pg_strlower(char *dst, size_t dstsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
--
2.43.0
From b6de7ad668d90d2c15568e8d0321f7b140c36e01 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 26 Nov 2025 10:28:36 -0800
Subject: [PATCH v11 5/9] Add pg_iswcased().
True if character has multiple case forms. Will be a useful
multibyte-aware replacement for char_is_cased().
---
src/backend/utils/adt/pg_locale.c | 11 +++++++++++
src/backend/utils/adt/pg_locale_builtin.c | 7 +++++++
src/backend/utils/adt/pg_locale_icu.c | 7 +++++++
src/backend/utils/adt/pg_locale_libc.c | 17 +++++++++++++++++
src/include/utils/pg_locale.h | 2 ++
5 files changed, 44 insertions(+)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 5aba277ba99..e5e75ca2c2c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1588,6 +1588,17 @@ pg_iswxdigit(pg_wchar wc, pg_locale_t locale)
return locale->ctype->wc_isxdigit(wc, locale);
}
+bool
+pg_iswcased(pg_wchar wc, pg_locale_t locale)
+{
+ /* for the C locale, Cased and Alpha are equivalent */
+ if (locale->ctype == NULL)
+ return (wc <= (pg_wchar) 127 &&
+ (pg_char_properties[wc] & PG_ISALPHA));
+ else
+ return locale->ctype->wc_iscased(wc, locale);
+}
+
pg_wchar
pg_towupper(pg_wchar wc, pg_locale_t locale)
{
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 1021e0d129b..0d4c754a267 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -185,6 +185,12 @@ wc_isxdigit_builtin(pg_wchar wc, pg_locale_t locale)
return pg_u_isxdigit(to_char32(wc), !locale->builtin.casemap_full);
}
+static bool
+wc_iscased_builtin(pg_wchar wc, pg_locale_t locale)
+{
+ return pg_u_prop_cased(to_char32(wc));
+}
+
static bool
char_is_cased_builtin(char ch, pg_locale_t locale)
{
@@ -220,6 +226,7 @@ static const struct ctype_methods ctype_methods_builtin = {
.wc_isspace = wc_isspace_builtin,
.wc_isxdigit = wc_isxdigit_builtin,
.char_is_cased = char_is_cased_builtin,
+ .wc_iscased = wc_iscased_builtin,
.wc_tolower = wc_tolower_builtin,
.wc_toupper = wc_toupper_builtin,
};
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index f5a0cc8fe41..e8820666b2d 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -223,6 +223,12 @@ wc_isxdigit_icu(pg_wchar wc, pg_locale_t locale)
return u_isxdigit(wc);
}
+static bool
+wc_iscased_icu(pg_wchar wc, pg_locale_t locale)
+{
+ return u_hasBinaryProperty(wc, UCHAR_CASED);
+}
+
static const struct ctype_methods ctype_methods_icu = {
.strlower = strlower_icu,
.strtitle = strtitle_icu,
@@ -239,6 +245,7 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_isspace = wc_isspace_icu,
.wc_isxdigit = wc_isxdigit_icu,
.char_is_cased = char_is_cased_icu,
+ .wc_iscased = wc_iscased_icu,
.wc_toupper = toupper_icu,
.wc_tolower = tolower_icu,
};
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 91a892bb540..cd54198f0c7 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -184,6 +184,13 @@ wc_isxdigit_libc_sb(pg_wchar wc, pg_locale_t locale)
#endif
}
+static bool
+wc_iscased_libc_sb(pg_wchar wc, pg_locale_t locale)
+{
+ return isupper_l((unsigned char) wc, locale->lt) ||
+ islower_l((unsigned char) wc, locale->lt);
+}
+
static bool
wc_isdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
{
@@ -248,6 +255,13 @@ wc_isxdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
#endif
}
+static bool
+wc_iscased_libc_mb(pg_wchar wc, pg_locale_t locale)
+{
+ return iswupper_l((wint_t) wc, locale->lt) ||
+ iswlower_l((wint_t) wc, locale->lt);
+}
+
static bool
char_is_cased_libc(char ch, pg_locale_t locale)
{
@@ -332,6 +346,7 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -357,6 +372,7 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -378,6 +394,7 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
.char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_mb,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
};
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 50520e50127..832007385d8 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -122,6 +122,7 @@ struct ctype_methods
bool (*wc_ispunct) (pg_wchar wc, pg_locale_t locale);
bool (*wc_isspace) (pg_wchar wc, pg_locale_t locale);
bool (*wc_isxdigit) (pg_wchar wc, pg_locale_t locale);
+ bool (*wc_iscased) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_toupper) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_tolower) (pg_wchar wc, pg_locale_t locale);
@@ -214,6 +215,7 @@ extern bool pg_iswprint(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswpunct(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswspace(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswxdigit(pg_wchar wc, pg_locale_t locale);
+extern bool pg_iswcased(pg_wchar wc, pg_locale_t locale);
extern pg_wchar pg_towupper(pg_wchar wc, pg_locale_t locale);
extern pg_wchar pg_towlower(pg_wchar wc, pg_locale_t locale);
--
2.43.0
From bf07696da676482d874602bd0f267b979dea5b82 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 26 Nov 2025 10:28:45 -0800
Subject: [PATCH v11 6/9] Use multibyte-aware extraction of pattern prefixes.
Previously, like_fixed_prefix() used char-at-a-time logic, which
forced it to be too conservative for case-insensitive matching.
Now, use pg_wchar-at-a-time loop for text types, along with proper
detection of cased characters; and preserve and char-at-a-time logic
for bytea.
Removes the pg_locale_t char_is_cased() single-byte method and
replaces it with a proper multibyte pg_iswcased() method.
Discussion: https://postgr.es/m/[email protected]
---
src/backend/utils/adt/like.c | 9 +-
src/backend/utils/adt/like_support.c | 128 ++++++++++++----------
src/backend/utils/adt/pg_locale.c | 15 ---
src/backend/utils/adt/pg_locale_builtin.c | 8 --
src/backend/utils/adt/pg_locale_icu.c | 8 --
src/backend/utils/adt/pg_locale_libc.c | 14 ---
src/include/utils/pg_locale.h | 3 -
7 files changed, 76 insertions(+), 109 deletions(-)
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 4a7fc583c71..772879f0a81 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -118,7 +118,7 @@ wchareq(const char *p1, const char *p2)
#include "like_match.c"
-/* setup to compile like_match.c for single byte case insensitive matches */
+/* setup to compile like_match.c for case-insensitive matches in C locale */
#define MATCH_LOWER
#define NextChar(p, plen) NextByte((p), (plen))
#define MatchText C_IMatchText
@@ -190,8 +190,11 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
errmsg("nondeterministic collations are not supported for ILIKE")));
/*
- * For efficiency reasons, in the C locale we don't call lower() on the
- * pattern and text, but instead call SB_lower_char on each character.
+ * For efficiency reasons, in the C locale lowercase each character
+ * lazily. Otherwise, we lowercase the entire pattern and text strings
+ * prior to matching.
+ *
+ * XXX: use casefolding instead?
*/
if (locale->ctype_is_c)
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 999f23f86d5..2db06bd1728 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -99,8 +99,6 @@ static Selectivity like_selectivity(const char *patt, int pattlen,
static Selectivity regex_selectivity(const char *patt, int pattlen,
bool case_insensitive,
int fixed_prefix_len);
-static int pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale);
static Const *make_greater_string(const Const *str_const, FmgrInfo *ltproc,
Oid collation);
static Datum string_to_datum(const char *str, Oid datatype);
@@ -989,13 +987,11 @@ static Pattern_Prefix_Status
like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
Const **prefix_const, Selectivity *rest_selec)
{
- char *match;
char *patt;
int pattlen;
Oid typeid = patt_const->consttype;
- int pos,
- match_pos;
- bool is_multibyte = (pg_database_encoding_max_length() > 1);
+ int pos;
+ int match_pos = 0;
pg_locale_t locale = 0;
/* the right-hand const is type text or bytea */
@@ -1023,60 +1019,94 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
locale = pg_newlocale_from_collation(collation);
}
+ /* for text types, use pg_wchar; for BYTEA, use char */
if (typeid != BYTEAOID)
{
- patt = TextDatumGetCString(patt_const->constvalue);
- pattlen = strlen(patt);
+ text *val = DatumGetTextPP(patt_const->constvalue);
+ pg_wchar *wpatt;
+ pg_wchar *wmatch;
+ char *match;
+ int match_mblen;
+
+ patt = VARDATA_ANY(val);
+ pattlen = VARSIZE_ANY_EXHDR(val);
+ wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+ pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+ wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+ for (pos = 0; pos < pattlen; pos++)
+ {
+ /* % and _ are wildcard characters in LIKE */
+ if (wpatt[pos] == '%' ||
+ wpatt[pos] == '_')
+ break;
+
+ /* Backslash escapes the next character */
+ if (wpatt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
+
+ /*
+ * For ILIKE, stop if it's a case-varying character (it's sort of
+ * a wildcard).
+ */
+ if (case_insensitive && pg_iswcased(wpatt[pos], locale))
+ break;
+
+ wmatch[match_pos++] = wpatt[pos];
+ }
+
+ wmatch[match_pos] = '\0';
+
+ match_mblen = pg_database_encoding_max_length() * match_pos + 1;
+ match = palloc(match_mblen);
+ pg_wchar2mb_with_len(wmatch, match, match_pos);
+
+ pfree(wpatt);
+ pfree(wmatch);
+
+ *prefix_const = string_to_const(match, typeid);
+ pfree(match);
}
else
{
bytea *bstr = DatumGetByteaPP(patt_const->constvalue);
+ char *match;
+ patt = VARDATA_ANY(bstr);
pattlen = VARSIZE_ANY_EXHDR(bstr);
- patt = (char *) palloc(pattlen);
- memcpy(patt, VARDATA_ANY(bstr), pattlen);
- Assert((Pointer) bstr == DatumGetPointer(patt_const->constvalue));
- }
-
- match = palloc(pattlen + 1);
- match_pos = 0;
- for (pos = 0; pos < pattlen; pos++)
- {
- /* % and _ are wildcard characters in LIKE */
- if (patt[pos] == '%' ||
- patt[pos] == '_')
- break;
- /* Backslash escapes the next character */
- if (patt[pos] == '\\')
+ match = palloc(pattlen + 1);
+ for (pos = 0; pos < pattlen; pos++)
{
- pos++;
- if (pos >= pattlen)
+ /* % and _ are wildcard characters in LIKE */
+ if (patt[pos] == '%' ||
+ patt[pos] == '_')
break;
- }
- /* Stop if case-varying character (it's sort of a wildcard) */
- if (case_insensitive &&
- pattern_char_isalpha(patt[pos], is_multibyte, locale))
- break;
-
- match[match_pos++] = patt[pos];
- }
+ /* Backslash escapes the next character */
+ if (patt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
- match[match_pos] = '\0';
+ match[match_pos++] = patt[pos];
+ }
- if (typeid != BYTEAOID)
- *prefix_const = string_to_const(match, typeid);
- else
*prefix_const = string_to_bytea_const(match, match_pos);
+ pfree(match);
+ }
+
if (rest_selec != NULL)
*rest_selec = like_selectivity(&patt[pos], pattlen - pos,
case_insensitive);
- pfree(patt);
- pfree(match);
-
/* in LIKE, an empty pattern is an exact match! */
if (pos == pattlen)
return Pattern_Prefix_Exact; /* reached end of pattern, so exact */
@@ -1481,24 +1511,6 @@ regex_selectivity(const char *patt, int pattlen, bool case_insensitive,
return sel;
}
-/*
- * Check whether char is a letter (and, hence, subject to case-folding)
- *
- * In multibyte character sets or with ICU, we can't use isalpha, and it does
- * not seem worth trying to convert to wchar_t to use iswalpha or u_isalpha.
- * Instead, just assume any non-ASCII char is potentially case-varying, and
- * hard-wire knowledge of which ASCII chars are letters.
- */
-static int
-pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale)
-{
- if (locale->ctype_is_c)
- return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
- else
- return char_is_cased(c, locale);
-}
-
/*
* For bytea, the increment function need only increment the current byte
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index e5e75ca2c2c..c4e89502f85 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1625,21 +1625,6 @@ pg_towlower(pg_wchar wc, pg_locale_t locale)
return locale->ctype->wc_tolower(wc, locale);
}
-/*
- * char_is_cased()
- *
- * Fuzzy test of whether the given char is case-varying or not. The argument
- * is a single byte, so in a multibyte encoding, just assume any non-ASCII
- * char is case-varying.
- */
-bool
-char_is_cased(char ch, pg_locale_t locale)
-{
- if (locale->ctype == NULL)
- return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
- return locale->ctype->char_is_cased(ch, locale);
-}
-
/*
* Return required encoding ID for the given locale, or -1 if any encoding is
* valid for the locale.
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 0d4c754a267..0c2920112bb 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -191,13 +191,6 @@ wc_iscased_builtin(pg_wchar wc, pg_locale_t locale)
return pg_u_prop_cased(to_char32(wc));
}
-static bool
-char_is_cased_builtin(char ch, pg_locale_t locale)
-{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
-}
-
static pg_wchar
wc_toupper_builtin(pg_wchar wc, pg_locale_t locale)
{
@@ -225,7 +218,6 @@ static const struct ctype_methods ctype_methods_builtin = {
.wc_ispunct = wc_ispunct_builtin,
.wc_isspace = wc_isspace_builtin,
.wc_isxdigit = wc_isxdigit_builtin,
- .char_is_cased = char_is_cased_builtin,
.wc_iscased = wc_iscased_builtin,
.wc_tolower = wc_tolower_builtin,
.wc_toupper = wc_toupper_builtin,
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index e8820666b2d..18d026deda8 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -121,13 +121,6 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const char *locale,
UErrorCode *pErrorCode);
-static bool
-char_is_cased_icu(char ch, pg_locale_t locale)
-{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
-}
-
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
* UChar32, which is correct for the UTF-8 encoding, but not in general.
@@ -244,7 +237,6 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_ispunct = wc_ispunct_icu,
.wc_isspace = wc_isspace_icu,
.wc_isxdigit = wc_isxdigit_icu,
- .char_is_cased = char_is_cased_icu,
.wc_iscased = wc_iscased_icu,
.wc_toupper = toupper_icu,
.wc_tolower = tolower_icu,
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index cd54198f0c7..4cb3c64b4a6 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -262,17 +262,6 @@ wc_iscased_libc_mb(pg_wchar wc, pg_locale_t locale)
iswlower_l((wint_t) wc, locale->lt);
}
-static bool
-char_is_cased_libc(char ch, pg_locale_t locale)
-{
- bool is_multibyte = pg_database_encoding_max_length() > 1;
-
- if (is_multibyte && IS_HIGHBIT_SET(ch))
- return true;
- else
- return isalpha_l((unsigned char) ch, locale->lt);
-}
-
static pg_wchar
toupper_libc_sb(pg_wchar wc, pg_locale_t locale)
{
@@ -345,7 +334,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
@@ -371,7 +359,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
@@ -393,7 +380,6 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_ispunct = wc_ispunct_libc_mb,
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
- .char_is_cased = char_is_cased_libc,
.wc_iscased = wc_iscased_libc_mb,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 832007385d8..01f891def7a 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -125,9 +125,6 @@ struct ctype_methods
bool (*wc_iscased) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_toupper) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_tolower) (pg_wchar wc, pg_locale_t locale);
-
- /* required */
- bool (*char_is_cased) (char ch, pg_locale_t locale);
};
/*
--
2.43.0
From 9d99649a07b5eb165254ca43ada6ffd1d4e36555 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 13:24:38 -0800
Subject: [PATCH v11 7/9] fuzzystrmatch: use pg_ascii_toupper().
fuzzystrmatch is designed for ASCII, so no need to rely on the global
LC_CTYPE setting.
Discussion: https://postgr.es/m/[email protected]
---
contrib/fuzzystrmatch/dmetaphone.c | 2 +-
contrib/fuzzystrmatch/fuzzystrmatch.c | 16 ++++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/contrib/fuzzystrmatch/dmetaphone.c b/contrib/fuzzystrmatch/dmetaphone.c
index 6627b2b8943..bb5d3e90756 100644
--- a/contrib/fuzzystrmatch/dmetaphone.c
+++ b/contrib/fuzzystrmatch/dmetaphone.c
@@ -284,7 +284,7 @@ MakeUpper(metastring *s)
char *i;
for (i = s->str; *i; i++)
- *i = toupper((unsigned char) *i);
+ *i = pg_ascii_toupper((unsigned char) *i);
}
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index e7cc314b763..7f07efc2c35 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -62,7 +62,7 @@ static const char *const soundex_table = "01230120022455012623010202";
static char
soundex_code(char letter)
{
- letter = toupper((unsigned char) letter);
+ letter = pg_ascii_toupper((unsigned char) letter);
/* Defend against non-ASCII letters */
if (letter >= 'A' && letter <= 'Z')
return soundex_table[letter - 'A'];
@@ -124,7 +124,7 @@ getcode(char c)
{
if (isalpha((unsigned char) c))
{
- c = toupper((unsigned char) c);
+ c = pg_ascii_toupper((unsigned char) c);
/* Defend against non-ASCII letters */
if (c >= 'A' && c <= 'Z')
return _codes[c - 'A'];
@@ -301,18 +301,18 @@ metaphone(PG_FUNCTION_ARGS)
* accessing the array directly... */
/* Look at the next letter in the word */
-#define Next_Letter (toupper((unsigned char) word[w_idx+1]))
+#define Next_Letter (pg_ascii_toupper((unsigned char) word[w_idx+1]))
/* Look at the current letter in the word */
-#define Curr_Letter (toupper((unsigned char) word[w_idx]))
+#define Curr_Letter (pg_ascii_toupper((unsigned char) word[w_idx]))
/* Go N letters back. */
#define Look_Back_Letter(n) \
- (w_idx >= (n) ? toupper((unsigned char) word[w_idx-(n)]) : '\0')
+ (w_idx >= (n) ? pg_ascii_toupper((unsigned char) word[w_idx-(n)]) : '\0')
/* Previous letter. I dunno, should this return null on failure? */
#define Prev_Letter (Look_Back_Letter(1))
/* Look two letters down. It makes sure you don't walk off the string. */
#define After_Next_Letter \
- (Next_Letter != '\0' ? toupper((unsigned char) word[w_idx+2]) : '\0')
-#define Look_Ahead_Letter(n) toupper((unsigned char) Lookahead(word+w_idx, n))
+ (Next_Letter != '\0' ? pg_ascii_toupper((unsigned char) word[w_idx+2]) : '\0')
+#define Look_Ahead_Letter(n) pg_ascii_toupper((unsigned char) Lookahead(word+w_idx, n))
/* Allows us to safely look ahead an arbitrary # of letters */
@@ -742,7 +742,7 @@ _soundex(const char *instr, char *outstr)
}
/* Take the first letter as is */
- *outstr++ = (char) toupper((unsigned char) *instr++);
+ *outstr++ = (char) pg_ascii_toupper((unsigned char) *instr++);
count = 1;
while (*instr && count < SOUNDEX_LEN)
--
2.43.0
From de1d8c438c74cbb0b8bba70172f02e746db21a05 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 20 Oct 2025 16:32:18 -0700
Subject: [PATCH v11 8/9] downcase_identifier(): use method table from locale
provider.
Previously, libc's tolower() was always used for identifier case
folding, regardless of the database locale (though only characters
beyond 127 in single-byte encodings were affected). Refactor to allow
each provider to supply its own implementation of identifier
casefolding.
For historical compatibility, when using a single-byte encoding, ICU
still relies on tolower().
One minor behavior change is that, before the database default locale
is initialized, it uses ASCII semantics to fold the
identifiers. Previously, it would use the postmaster's LC_CTYPE
setting from the environment. While that could have some effect during
GUC processing, for example, it would have been fragile to rely on the
environment setting anyway. (Also, it only matters when the encoding
is single-byte.)
Discussion: https://postgr.es/m/[email protected]
---
src/backend/parser/scansup.c | 39 +++++++---------
src/backend/utils/adt/pg_locale.c | 32 +++++++++++++
src/backend/utils/adt/pg_locale_builtin.c | 24 ++++++++++
src/backend/utils/adt/pg_locale_icu.c | 36 ++++++++++++++-
src/backend/utils/adt/pg_locale_libc.c | 55 +++++++++++++++++++++++
src/include/utils/pg_locale.h | 5 +++
6 files changed, 166 insertions(+), 25 deletions(-)
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index 2feb2b6cf5a..0bd049643d1 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -18,6 +18,7 @@
#include "mb/pg_wchar.h"
#include "parser/scansup.h"
+#include "utils/pg_locale.h"
/*
@@ -46,35 +47,25 @@ char *
downcase_identifier(const char *ident, int len, bool warn, bool truncate)
{
char *result;
- int i;
- bool enc_is_single_byte;
-
- result = palloc(len + 1);
- enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ size_t dstsize;
+ size_t needed pg_attribute_unused();
/*
- * SQL99 specifies Unicode-aware case normalization, which we don't yet
- * have the infrastructure for. Instead we use tolower() to provide a
- * locale-aware translation. However, there are some locales where this
- * is not right either (eg, Turkish may do strange things with 'i' and
- * 'I'). Our current compromise is to use tolower() for characters with
- * the high bit set, as long as they aren't part of a multi-byte
- * character, and use an ASCII-only downcasing for 7-bit characters.
+ * Preserves string length.
+ *
+ * NB: if we decide to support Unicode-aware identifier case folding, then
+ * we need to account for a change in string length.
*/
- for (i = 0; i < len; i++)
- {
- unsigned char ch = (unsigned char) ident[i];
+ dstsize = len + 1;
+ result = palloc(dstsize);
- if (ch >= 'A' && ch <= 'Z')
- ch += 'a' - 'A';
- else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
- ch = tolower(ch);
- result[i] = (char) ch;
- }
- result[i] = '\0';
+ needed = pg_strfold_ident(result, dstsize, ident, len);
+ Assert(needed + 1 == dstsize);
+ Assert(needed == len);
+ Assert(result[len] == '\0');
- if (i >= NAMEDATALEN && truncate)
- truncate_identifier(result, i, warn);
+ if (len >= NAMEDATALEN && truncate)
+ truncate_identifier(result, len, warn);
return result;
}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index c4e89502f85..9167018c85b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1352,6 +1352,38 @@ pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
}
+/*
+ * Fold an identifier using the database default locale.
+ *
+ * For historical reasons, does not use ordinary locale behavior. Should only
+ * be used for identifier folding. XXX: can we make this equivalent to
+ * pg_strfold(..., default_locale)?
+ */
+size_t
+pg_strfold_ident(char *dest, size_t destsize, const char *src, ssize_t srclen)
+{
+ if (default_locale == NULL || default_locale->ctype == NULL)
+ {
+ int i;
+
+ for (i = 0; i < srclen && i < destsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dest[i] = (char) ch;
+ }
+
+ if (i < destsize)
+ dest[i] = '\0';
+
+ return srclen;
+ }
+ return default_locale->ctype->strfold_ident(dest, destsize, src, srclen,
+ default_locale);
+}
+
/*
* pg_strcoll
*
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 0c2920112bb..659e588d513 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -125,6 +125,29 @@ strfold_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
locale->builtin.casemap_full);
}
+static size_t
+strfold_ident_builtin(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+
+ Assert(GetDatabaseEncoding() == PG_UTF8);
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
static bool
wc_isdigit_builtin(pg_wchar wc, pg_locale_t locale)
{
@@ -208,6 +231,7 @@ static const struct ctype_methods ctype_methods_builtin = {
.strtitle = strtitle_builtin,
.strupper = strupper_builtin,
.strfold = strfold_builtin,
+ .strfold_ident = strfold_ident_builtin,
.wc_isdigit = wc_isdigit_builtin,
.wc_isalpha = wc_isalpha_builtin,
.wc_isalnum = wc_isalnum_builtin,
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 18d026deda8..39b153a4262 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -61,6 +61,8 @@ static size_t strupper_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
static size_t strfold_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
+static size_t strfold_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
static int strncoll_icu(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2,
pg_locale_t locale);
@@ -123,7 +125,7 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
- * UChar32, which is correct for the UTF-8 encoding, but not in general.
+ * UChar32, which is correct for UTF-8 and LATIN1, but not in general.
*/
static pg_wchar
@@ -227,6 +229,7 @@ static const struct ctype_methods ctype_methods_icu = {
.strtitle = strtitle_icu,
.strupper = strupper_icu,
.strfold = strfold_icu,
+ .strfold_ident = strfold_ident_icu,
.wc_isdigit = wc_isdigit_icu,
.wc_isalpha = wc_isalpha_icu,
.wc_isalnum = wc_isalnum_icu,
@@ -564,6 +567,37 @@ strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return result_len;
}
+/*
+ * For historical compatibility, behavior is not multibyte-aware.
+ *
+ * NB: uses libc tolower() for single-byte encodings (also for historical
+ * compatibility), and therefore relies on the global LC_CTYPE setting.
+ */
+static size_t
+strfold_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+ bool enc_is_single_byte;
+
+ enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
+ ch = tolower(ch);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
/*
* strncoll_icu_utf8
*
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 4cb3c64b4a6..85c7885a8ae 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -318,12 +318,65 @@ tolower_libc_mb(pg_wchar wc, pg_locale_t locale)
return wc;
}
+/*
+ * Characters A..Z always fold to a..z, even in the Turkish locale. Characters
+ * beyond 127 use tolower().
+ */
+static size_t
+strfold_ident_libc_sb(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ locale_t loc = locale->lt;
+ int i;
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ else if (IS_HIGHBIT_SET(ch) && isupper_l(ch, loc))
+ ch = tolower_l(ch, loc);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
+/*
+ * For historical reasons, not multibyte-aware; uses plain ASCII semantics.
+ */
+static size_t
+strfold_ident_libc_mb(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
static const struct ctype_methods ctype_methods_libc_sb = {
.strlower = strlower_libc_sb,
.strtitle = strtitle_libc_sb,
.strupper = strupper_libc_sb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_sb,
+ .strfold_ident = strfold_ident_libc_sb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -349,6 +402,7 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.strupper = strupper_libc_mb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_mb,
+ .strfold_ident = strfold_ident_libc_mb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -370,6 +424,7 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.strupper = strupper_libc_mb,
/* in libc, casefolding is the same as lowercasing */
.strfold = strlower_libc_mb,
+ .strfold_ident = strfold_ident_libc_mb,
.wc_isdigit = wc_isdigit_libc_mb,
.wc_isalpha = wc_isalpha_libc_mb,
.wc_isalnum = wc_isalnum_libc_mb,
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 01f891def7a..53574d2ef85 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -110,6 +110,9 @@ struct ctype_methods
size_t (*strfold) (char *dest, size_t destsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+ size_t (*strfold_ident) (char *dest, size_t destsize,
+ const char *src, ssize_t srclen,
+ pg_locale_t locale);
/* required */
bool (*wc_isdigit) (pg_wchar wc, pg_locale_t locale);
@@ -188,6 +191,8 @@ extern size_t pg_strupper(char *dst, size_t dstsize,
extern size_t pg_strfold(char *dst, size_t dstsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+extern size_t pg_strfold_ident(char *dst, size_t dstsize,
+ const char *src, ssize_t srclen);
extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
extern int pg_strncoll(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2, pg_locale_t locale);
--
2.43.0
From d7970e1db1b3185c509be22839857ecc4c2a140e Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 24 Nov 2025 14:00:52 -0800
Subject: [PATCH v11 9/9] Control LC_COLLATE with GUC.
Now that the global LC_COLLATE setting is not used for any in-core
purpose at all (see commit 5e6e42e44f), allow it to be set with a
GUC. This may be useful for extensions or procedural languages that
still depend on the global LC_COLLATE setting.
Discussion: https://postgr.es/m/[email protected]
---
src/backend/utils/adt/pg_locale.c | 59 +++++++++++++++++++
src/backend/utils/init/postinit.c | 2 +
src/backend/utils/misc/guc_parameters.dat | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 3 +
src/include/utils/guc_hooks.h | 2 +
src/include/utils/pg_locale.h | 1 +
7 files changed, 78 insertions(+)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 9167018c85b..91e7eba2eac 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -81,6 +81,7 @@ extern pg_locale_t create_pg_locale_libc(Oid collid, MemoryContext context);
extern char *get_collation_actual_version_libc(const char *collcollate);
/* GUC settings */
+char *locale_collate;
char *locale_messages;
char *locale_monetary;
char *locale_numeric;
@@ -369,6 +370,64 @@ assign_locale_time(const char *newval, void *extra)
CurrentLCTimeValid = false;
}
+/*
+ * We allow LC_COLLATE to actually be set globally.
+ *
+ * Note: we normally disallow value = "" because it wouldn't have consistent
+ * semantics (it'd effectively just use the previous value). However, this
+ * is the value passed for PGC_S_DEFAULT, so don't complain in that case,
+ * not even if the attempted setting fails due to invalid environment value.
+ * The idea there is just to accept the environment setting *if possible*
+ * during startup, until we can read the proper value from postgresql.conf.
+ */
+bool
+check_locale_collate(char **newval, void **extra, GucSource source)
+{
+ int locale_enc;
+ int db_enc;
+
+ if (**newval == '\0')
+ {
+ if (source == PGC_S_DEFAULT)
+ return true;
+ else
+ return false;
+ }
+
+ locale_enc = pg_get_encoding_from_locale(*newval, true);
+ db_enc = GetDatabaseEncoding();
+
+ if (!(locale_enc == db_enc ||
+ locale_enc == PG_SQL_ASCII ||
+ db_enc == PG_SQL_ASCII ||
+ locale_enc == -1))
+ {
+ if (source == PGC_S_FILE)
+ {
+ guc_free(*newval);
+ *newval = guc_strdup(LOG, "C");
+ if (!*newval)
+ return false;
+ }
+ else if (source != PGC_S_TEST)
+ {
+ ereport(WARNING,
+ (errmsg("encoding mismatch"),
+ errdetail("Locale \"%s\" uses encoding \"%s\", which does not match database encoding \"%s\".",
+ *newval, pg_encoding_to_char(locale_enc), pg_encoding_to_char(db_enc))));
+ return false;
+ }
+ }
+
+ return check_locale(LC_COLLATE, *newval, NULL);
+}
+
+void
+assign_locale_collate(const char *newval, void *extra)
+{
+ (void) pg_perm_setlocale(LC_COLLATE, newval);
+}
+
/*
* We allow LC_MESSAGES to actually be set globally.
*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 98f9598cd78..c99d57eba48 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -404,6 +404,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
* the pg_database tuple.
*/
SetDatabaseEncoding(dbform->encoding);
+ /* Reset lc_collate to check encoding, and fall back to C if necessary */
+ SetConfigOption("lc_collate", locale_collate, PGC_POSTMASTER, PGC_S_FILE);
/* Record it as a GUC internal option, too */
SetConfigOption("server_encoding", GetDatabaseEncodingName(),
PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..a36c680719f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1457,6 +1457,15 @@
boot_val => 'PG_KRB_SRVTAB',
},
+{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
+ short_desc => 'Sets the locale for text ordering in extensions.',
+ long_desc => 'An empty string means use the operating system setting.',
+ variable => 'locale_collate',
+ boot_val => '""',
+ check_hook => 'check_locale_collate',
+ assign_hook => 'assign_locale_collate',
+},
+
{ name => 'lc_messages', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
short_desc => 'Sets the language in which messages are displayed.',
long_desc => 'An empty string means use the operating system setting.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..19332e39e82 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -798,6 +798,8 @@
# encoding
# These settings are initialized by initdb, but they can be changed.
+#lc_collate = '' # locale for text ordering (only affects
+ # extensions)
#lc_messages = '' # locale for system error message
# strings
#lc_monetary = 'C' # locale for monetary formatting
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..8b2e7bfab6f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1312,6 +1312,9 @@ setup_config(void)
conflines = replace_guc_value(conflines, "shared_buffers",
repltok, false);
+ conflines = replace_guc_value(conflines, "lc_collate",
+ lc_collate, false);
+
conflines = replace_guc_value(conflines, "lc_messages",
lc_messages, false);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..8a20f76eec8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -65,6 +65,8 @@ extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
extern void assign_io_method(int newval, void *extra);
extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
extern const char *show_in_hot_standby(void);
+extern bool check_locale_collate(char **newval, void **extra, GucSource source);
+extern void assign_locale_collate(const char *newval, void *extra);
extern bool check_locale_messages(char **newval, void **extra, GucSource source);
extern void assign_locale_messages(const char *newval, void *extra);
extern bool check_locale_monetary(char **newval, void **extra, GucSource source);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 53574d2ef85..276be4c1fef 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -41,6 +41,7 @@
#define UNICODE_CASEMAP_BUFSZ (UNICODE_CASEMAP_LEN * sizeof(char32_t))
/* GUC settings */
+extern PGDLLIMPORT char *locale_collate;
extern PGDLLIMPORT char *locale_messages;
extern PGDLLIMPORT char *locale_monetary;
extern PGDLLIMPORT char *locale_numeric;
--
2.43.0