On Thu, 2025-11-20 at 16:58 -0800, Jeff Davis wrote:
> On Wed, 2025-11-12 at 19:59 +0100, Peter Eisentraut wrote:
> > Many of these issues are pre-existing, but I just figured it has
> > reached
> > a point where we need to do something about it.
>
> I tried to simplify things in this patch series, assuming that we
> have
> some tolerance for small behavior changes.
>
> 0001: No behavior change here, same patch as before. Uncontroversial
> simplification, so I plan to commit this soon.
Committed.
New series attached, which I tried to put in an order that would be
reasonable for commit.
0001-0004: Pure refactoring patches. I intend to commit a couple of
these soon.
0005: No behavioral change, and not much change at all. Computes the
"max_chr" for regexes (a performance optimization for low codepoints)
more consistently and simply based on the encoding.
0006: fixes longstanding ltree bug due to inconsistency between the
database locale and the global LC_CTYPE setting when using a non-libc
provider. The end result is also cleaner: use the database locale
consistently, like tsearch. I don't intend to backport this, unless
someone thinks it should be, but it should come with a release note to
reindex ltree indexes if using a non-libc provider.
0007: remove the char_tolower() API completely. We'd lose a pattern
matching optimization for single-byte encodings with libc and a non-C
locale, but it's a significant simplification. We could go even further
and change this to use casefolding rather than lower(), but that seems
like a separate change.
0008: Multibyte-aware extraction of pattern prefixes. The previous code
gave up on any byte that it didn't understand, which made prefixes
unnecessarily short. This patch is also cleaner.
0009: Changes fuzzystrmatch to use pg_ascii_toupper(). Most functions
in the extension are unaffected, but soundex() can be affected, and I'm
not sure what exactly it's supposed to do with non-ASCII.
0010: For downcase_identifier(), use a new provider-specific
pg_strfold_ident() method. The ICU version of this method is a work-in-
progress, because right now it depends on libc. I suppose it should
decode to UTF-32, then go through u_tolower(), then re-encode -- but
can the re-encoding fail? In any case, it would be a behavior change
for identifier casefolding with ICU and a single-byte encoding, which
is probably OK but the risk is non-zero.
0011: POC patch to introduce lc_collate GUC. It would only affect
extensions, PLs, libraries, or other non-core code that happens to call
strcoll() or strxfrm(). This would address Daniel's complaint, but it's
more flexible. And by being a GUC, it's clear that we shouldn't depend
on it for any stored data. We can do something similar for LC_CTYPE
after we eliminate dependencies in core code.
Regards,
Jeff Davis
From 6e434d1f13f50654a89d19528b8f498c6cd10cca Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Thu, 20 Nov 2025 13:09:26 -0800
Subject: [PATCH v9 01/11] Inline pg_ascii_tolower() and pg_ascii_toupper().
---
src/include/port.h | 25 +++++++++++++++++++++++--
src/port/pgstrcasecmp.c | 26 --------------------------
2 files changed, 23 insertions(+), 28 deletions(-)
diff --git a/src/include/port.h b/src/include/port.h
index 3964d3b1293..159c2bcd7e3 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -169,8 +169,29 @@ extern int pg_strcasecmp(const char *s1, const char *s2);
extern int pg_strncasecmp(const char *s1, const char *s2, size_t n);
extern unsigned char pg_toupper(unsigned char ch);
extern unsigned char pg_tolower(unsigned char ch);
-extern unsigned char pg_ascii_toupper(unsigned char ch);
-extern unsigned char pg_ascii_tolower(unsigned char ch);
+
+/*
+ * Fold a character to upper case, following C/POSIX locale rules.
+ */
+static inline unsigned char
+pg_ascii_toupper(unsigned char ch)
+{
+ if (ch >= 'a' && ch <= 'z')
+ ch += 'A' - 'a';
+ return ch;
+}
+
+/*
+ * Fold a character to lower case, following C/POSIX locale rules.
+ */
+static inline unsigned char
+pg_ascii_tolower(unsigned char ch)
+{
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ return ch;
+}
+
/*
* Beginning in v12, we always replace snprintf() and friends with our own
diff --git a/src/port/pgstrcasecmp.c b/src/port/pgstrcasecmp.c
index ec2b3a75c3d..17e93180381 100644
--- a/src/port/pgstrcasecmp.c
+++ b/src/port/pgstrcasecmp.c
@@ -13,10 +13,6 @@
*
* NB: this code should match downcase_truncate_identifier() in scansup.c.
*
- * We also provide strict ASCII-only case conversion functions, which can
- * be used to implement C/POSIX case folding semantics no matter what the
- * C library thinks the locale is.
- *
*
* Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
*
@@ -127,25 +123,3 @@ pg_tolower(unsigned char ch)
ch = tolower(ch);
return ch;
}
-
-/*
- * Fold a character to upper case, following C/POSIX locale rules.
- */
-unsigned char
-pg_ascii_toupper(unsigned char ch)
-{
- if (ch >= 'a' && ch <= 'z')
- ch += 'A' - 'a';
- return ch;
-}
-
-/*
- * Fold a character to lower case, following C/POSIX locale rules.
- */
-unsigned char
-pg_ascii_tolower(unsigned char ch)
-{
- if (ch >= 'A' && ch <= 'Z')
- ch += 'a' - 'A';
- return ch;
-}
--
2.43.0
From 5538939cae6210a4ee702253b0287e44993b98b4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 10:11:52 -0800
Subject: [PATCH v9 02/11] Add #define for UNICODE_CASEMAP_BUFSZ.
Useful for mapping a single codepoint at a time into a
statically-allocated buffer.
---
src/include/utils/pg_locale.h | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 683e1a0eef8..49fd22bf8eb 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -26,6 +26,17 @@
/* use for libc locale names */
#define LOCALE_NAME_BUFLEN 128
+/*
+ * Maximum number of bytes needed to map a single codepoint. Useful for
+ * mapping and processing a single input codepoint at a time with a
+ * statically-allocated buffer.
+ *
+ * With full case mapping, an input codepoint may be mapped to as many as
+ * three output codepoints. See Unicode 5.18.2, "Change in Length".
+ */
+#define UNICODE_CASEMAP_LEN 3
+#define UNICODE_CASEMAP_BUFSZ (UNICODE_CASEMAP_LEN * sizeof(char32_t))
+
/* GUC settings */
extern PGDLLIMPORT char *locale_messages;
extern PGDLLIMPORT char *locale_monetary;
--
2.43.0
From 1bd58b5aaf092257fa09a456fdb859328548afeb Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 24 Nov 2025 09:09:06 -0800
Subject: [PATCH v9 03/11] Change some callers to use pg_ascii_toupper().
The input is ASCII anyway, so it's better to be clear that it's not
locale-dependent.
---
src/backend/access/transam/xlogfuncs.c | 2 +-
src/backend/utils/adt/cash.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e45fce43ed..a50345f9bf7 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -479,7 +479,7 @@ pg_split_walfile_name(PG_FUNCTION_ARGS)
/* Capitalize WAL file name. */
for (p = fname_upper; *p; p++)
- *p = pg_toupper((unsigned char) *p);
+ *p = pg_ascii_toupper((unsigned char) *p);
if (!IsXLogFileName(fname_upper))
ereport(ERROR,
diff --git a/src/backend/utils/adt/cash.c b/src/backend/utils/adt/cash.c
index 611d23f3cb0..623f6eec056 100644
--- a/src/backend/utils/adt/cash.c
+++ b/src/backend/utils/adt/cash.c
@@ -1035,7 +1035,7 @@ cash_words(PG_FUNCTION_ARGS)
appendStringInfoString(&buf, m0 == 1 ? " cent" : " cents");
/* capitalize output */
- buf.data[0] = pg_toupper((unsigned char) buf.data[0]);
+ buf.data[0] = pg_ascii_toupper((unsigned char) buf.data[0]);
/* return as text datum */
res = cstring_to_text_with_len(buf.data, buf.len);
--
2.43.0
From 22419241e495c163d40e893d818741db7b1f3c78 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 7 Nov 2025 12:11:34 -0800
Subject: [PATCH v9 04/11] Allow pg_locale_t APIs to work when ctype_is_c.
Previously, the caller needed to check ctype_is_c first for some
routines and not others. Now, the APIs consistently work, and the
caller can just check ctype_is_c for optimization purposes.
---
src/backend/utils/adt/like_support.c | 34 ++++----------
src/backend/utils/adt/pg_locale.c | 63 ++++++++++++++++++++++++--
src/backend/utils/adt/pg_locale_libc.c | 3 ++
3 files changed, 72 insertions(+), 28 deletions(-)
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 999f23f86d5..0debccfa67b 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -99,8 +99,6 @@ static Selectivity like_selectivity(const char *patt, int pattlen,
static Selectivity regex_selectivity(const char *patt, int pattlen,
bool case_insensitive,
int fixed_prefix_len);
-static int pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale);
static Const *make_greater_string(const Const *str_const, FmgrInfo *ltproc,
Oid collation);
static Datum string_to_datum(const char *str, Oid datatype);
@@ -995,7 +993,6 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
Oid typeid = patt_const->consttype;
int pos,
match_pos;
- bool is_multibyte = (pg_database_encoding_max_length() > 1);
pg_locale_t locale = 0;
/* the right-hand const is type text or bytea */
@@ -1055,9 +1052,16 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
break;
}
- /* Stop if case-varying character (it's sort of a wildcard) */
- if (case_insensitive &&
- pattern_char_isalpha(patt[pos], is_multibyte, locale))
+ /*
+ * Stop if case-varying character (it's sort of a wildcard).
+ *
+ * In multibyte character sets or with non-libc providers, we can't
+ * use isalpha, and it does not seem worth trying to convert to
+ * wchar_t or char32_t. Instead, just pass the single byte to the
+ * provider, which will assume any non-ASCII char is potentially
+ * case-varying.
+ */
+ if (case_insensitive && char_is_cased(patt[pos], locale))
break;
match[match_pos++] = patt[pos];
@@ -1481,24 +1485,6 @@ regex_selectivity(const char *patt, int pattlen, bool case_insensitive,
return sel;
}
-/*
- * Check whether char is a letter (and, hence, subject to case-folding)
- *
- * In multibyte character sets or with ICU, we can't use isalpha, and it does
- * not seem worth trying to convert to wchar_t to use iswalpha or u_isalpha.
- * Instead, just assume any non-ASCII char is potentially case-varying, and
- * hard-wire knowledge of which ASCII chars are letters.
- */
-static int
-pattern_char_isalpha(char c, bool is_multibyte,
- pg_locale_t locale)
-{
- if (locale->ctype_is_c)
- return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
- else
- return char_is_cased(c, locale);
-}
-
/*
* For bytea, the increment function need only increment the current byte
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b14c7837938..9319fb633b6 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1261,6 +1261,17 @@ size_t
pg_strlower(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
+ if (locale->ctype == NULL)
+ {
+ int i;
+
+ srclen = (srclen >= 0) ? srclen : strlen(src);
+ for (i = 0; i < srclen && i < dstsize; i++)
+ dst[i] = pg_ascii_tolower(src[i]);
+ if (i < dstsize)
+ dst[i] = '\0';
+ return srclen;
+ }
return locale->ctype->strlower(dst, dstsize, src, srclen, locale);
}
@@ -1268,6 +1279,29 @@ size_t
pg_strtitle(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
+ if (locale->ctype == NULL)
+ {
+ bool wasalnum = false;
+ int i;
+
+ srclen = (srclen >= 0) ? srclen : strlen(src);
+ for (i = 0; i < Min(srclen, dstsize); i++)
+ {
+ char c = src[i];
+
+ if (wasalnum)
+ dst[i] = pg_ascii_tolower(c);
+ else
+ dst[i] = pg_ascii_toupper(c);
+
+ wasalnum = ((c >= '0' && c <= '9') ||
+ (c >= 'A' && c <= 'Z') ||
+ (c >= 'a' && c <= 'z'));
+ }
+ if (i < dstsize)
+ dst[i] = '\0';
+ return srclen;
+ }
return locale->ctype->strtitle(dst, dstsize, src, srclen, locale);
}
@@ -1275,6 +1309,17 @@ size_t
pg_strupper(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
+ if (locale->ctype == NULL)
+ {
+ int i;
+
+ srclen = (srclen >= 0) ? srclen : strlen(src);
+ for (i = 0; i < srclen && i < dstsize; i++)
+ dst[i] = pg_ascii_toupper(src[i]);
+ if (i < dstsize)
+ dst[i] = '\0';
+ return srclen;
+ }
return locale->ctype->strupper(dst, dstsize, src, srclen, locale);
}
@@ -1282,10 +1327,18 @@ size_t
pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
pg_locale_t locale)
{
- if (locale->ctype->strfold)
- return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
- else
- return locale->ctype->strlower(dst, dstsize, src, srclen, locale);
+ if (locale->ctype == NULL)
+ {
+ int i;
+
+ srclen = (srclen >= 0) ? srclen : strlen(src);
+ for (i = 0; i < srclen && i < dstsize; i++)
+ dst[i] = pg_ascii_tolower(src[i]);
+ if (i < dstsize)
+ dst[i] = '\0';
+ return srclen;
+ }
+ return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
}
/*
@@ -1560,6 +1613,8 @@ pg_towlower(pg_wchar wc, pg_locale_t locale)
bool
char_is_cased(char ch, pg_locale_t locale)
{
+ if (locale->ctype == NULL)
+ return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
return locale->ctype->char_is_cased(ch, locale);
}
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 716f005066a..942454de4ed 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -326,6 +326,7 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.strlower = strlower_libc_sb,
.strtitle = strtitle_libc_sb,
.strupper = strupper_libc_sb,
+ .strfold = strlower_libc_sb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -351,6 +352,7 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.strlower = strlower_libc_mb,
.strtitle = strtitle_libc_mb,
.strupper = strupper_libc_mb,
+ .strfold = strlower_libc_mb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -372,6 +374,7 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.strlower = strlower_libc_mb,
.strtitle = strtitle_libc_mb,
.strupper = strupper_libc_mb,
+ .strfold = strlower_libc_mb,
.wc_isdigit = wc_isdigit_libc_mb,
.wc_isalpha = wc_isalpha_libc_mb,
.wc_isalnum = wc_isalnum_libc_mb,
--
2.43.0
From b1add0b2b4c9785b56e4dc222a89ec8f43b9c586 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 21 Nov 2025 12:41:47 -0800
Subject: [PATCH v9 05/11] Make regex "max_chr" depend on encoding, not
provider.
The previous per-provider "max_chr" field was there as a hack to
preserve the exact prior behavior, which depended on the
provider. Change to depend on the encoding, which makes more sense,
and remove the per-provider logic.
The only difference is for ICU: previously it always used
MAX_SIMPLE_CHR (0x7FF) regardless of the encoding; whereas now it will
match libc and use MAX_SIMPLE_CHR for UTF-8, and MAX_UCHAR for other
encodings. That's possibly a loss for non-UTF8 multibyte encodings,
but a win for single-byte encodings. Regardless, this distinction was
not worth the complexity.
---
src/backend/regex/regc_pg_locale.c | 18 ++++++++++--------
src/backend/utils/adt/pg_locale_libc.c | 2 --
src/include/utils/pg_locale.h | 6 ------
3 files changed, 10 insertions(+), 16 deletions(-)
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 4698f110a0c..bb0e3f1d139 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -320,16 +320,18 @@ regc_ctype_get_cache(regc_wc_probefunc probefunc, int cclasscode)
max_chr = (pg_wchar) MAX_SIMPLE_CHR;
#endif
}
+ else if (GetDatabaseEncoding() == PG_UTF8)
+ {
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+ }
else
{
- if (pg_regex_locale->ctype->max_chr != 0 &&
- pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR)
- {
- max_chr = pg_regex_locale->ctype->max_chr;
- pcc->cv.cclasscode = -1;
- }
- else
- max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#if MAX_SIMPLE_CHR >= UCHAR_MAX
+ max_chr = (pg_wchar) UCHAR_MAX;
+ pcc->cv.cclasscode = -1;
+#else
+ max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+#endif
}
/*
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 942454de4ed..a55167b0697 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -341,7 +341,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
- .max_chr = UCHAR_MAX,
};
/*
@@ -367,7 +366,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
- .max_chr = UCHAR_MAX,
};
static const struct ctype_methods ctype_methods_libc_utf8 = {
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 49fd22bf8eb..40e58cc52b8 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -131,12 +131,6 @@ struct ctype_methods
* pg_strlower().
*/
char (*char_tolower) (unsigned char ch, pg_locale_t locale);
-
- /*
- * For regex and pattern matching efficiency, the maximum char value
- * supported by the above methods. If zero, limit is set by regex code.
- */
- pg_wchar max_chr;
};
/*
--
2.43.0
From ce19c7193a3b94a8afa0890f99338d5f4bc1aebe Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 10:20:36 -0800
Subject: [PATCH v9 06/11] Fix inconsistency between ltree_strncasecmp() and
ltree_crc32_sz().
Previously, ltree_strncasecmp() used lowercasing with the default
collation; while ltree_crc32_sz used tolower() directly. These were
equivalent only if the default collation provider was libc and the
encoding is single-byte.
Change both to use casefolding with the default collation.
---
contrib/ltree/crc32.c | 46 ++++++++++++++++++++++++++++++++-------
contrib/ltree/lquery_op.c | 31 ++++++++++++++++++++++++--
2 files changed, 67 insertions(+), 10 deletions(-)
diff --git a/contrib/ltree/crc32.c b/contrib/ltree/crc32.c
index 134f46a805e..3918d4a0ec2 100644
--- a/contrib/ltree/crc32.c
+++ b/contrib/ltree/crc32.c
@@ -10,31 +10,61 @@
#include "postgres.h"
#include "ltree.h"
+#include "crc32.h"
+#include "utils/pg_crc.h"
#ifdef LOWER_NODE
-#include <ctype.h>
-#define TOLOWER(x) tolower((unsigned char) (x))
-#else
-#define TOLOWER(x) (x)
+#include "utils/pg_locale.h"
#endif
-#include "crc32.h"
-#include "utils/pg_crc.h"
+#ifdef LOWER_NODE
unsigned int
ltree_crc32_sz(const char *buf, int size)
{
pg_crc32 crc;
const char *p = buf;
+ static pg_locale_t locale = NULL;
+
+ if (!locale)
+ locale = pg_database_locale();
INIT_TRADITIONAL_CRC32(crc);
while (size > 0)
{
- char c = (char) TOLOWER(*p);
+ char foldstr[UNICODE_CASEMAP_BUFSZ];
+ int srclen = pg_mblen(p);
+ size_t foldlen;
+
+ /* fold one codepoint at a time */
+ foldlen = pg_strfold(foldstr, UNICODE_CASEMAP_BUFSZ, p, srclen,
+ locale);
+
+ COMP_TRADITIONAL_CRC32(crc, foldstr, foldlen);
+
+ size -= srclen;
+ p += srclen;
+ }
+ FIN_TRADITIONAL_CRC32(crc);
+ return (unsigned int) crc;
+}
+
+#else
- COMP_TRADITIONAL_CRC32(crc, &c, 1);
+unsigned int
+ltree_crc32_sz(const char *buf, int size)
+{
+ pg_crc32 crc;
+ const char *p = buf;
+
+ INIT_TRADITIONAL_CRC32(crc);
+ while (size > 0)
+ {
+ COMP_TRADITIONAL_CRC32(crc, p, 1);
size--;
p++;
}
FIN_TRADITIONAL_CRC32(crc);
return (unsigned int) crc;
}
+
+#endif /* !LOWER_NODE */
diff --git a/contrib/ltree/lquery_op.c b/contrib/ltree/lquery_op.c
index a6466f575fd..d6754eb613f 100644
--- a/contrib/ltree/lquery_op.c
+++ b/contrib/ltree/lquery_op.c
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
int
ltree_strncasecmp(const char *a, const char *b, size_t s)
{
- char *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
- char *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+ static pg_locale_t locale = NULL;
+ size_t al_sz = s + 1;
+ char *al = palloc(al_sz);
+ size_t bl_sz = s + 1;
+ char *bl = palloc(bl_sz);
+ size_t needed;
int res;
+ if (!locale)
+ locale = pg_database_locale();
+
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ if (needed + 1 > al_sz)
+ {
+ /* grow buffer if needed and retry */
+ al_sz = needed + 1;
+ al = repalloc(al, al_sz);
+ needed = pg_strfold(al, al_sz, a, s, locale);
+ Assert(needed + 1 <= al_sz);
+ }
+
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ if (needed + 1 > bl_sz)
+ {
+ /* grow buffer if needed and retry */
+ bl_sz = needed + 1;
+ bl = repalloc(bl, bl_sz);
+ needed = pg_strfold(bl, bl_sz, b, s, locale);
+ Assert(needed + 1 <= bl_sz);
+ }
+
res = strncmp(al, bl, s);
pfree(al);
--
2.43.0
From 4403ac9fdaa0a66ca905d8376313db9c7250d98a Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 18:16:41 -0800
Subject: [PATCH v9 07/11] Remove char_tolower() API.
It's only useful for an ILIKE optimization for the libc provider using
a single-byte encoding and a non-C locale, but it creates significant
internal complexity.
---
src/backend/utils/adt/like.c | 42 +++++++++-----------------
src/backend/utils/adt/like_match.c | 18 ++++++-----
src/backend/utils/adt/pg_locale.c | 22 --------------
src/backend/utils/adt/pg_locale_libc.c | 10 ------
src/include/utils/pg_locale.h | 9 ------
5 files changed, 25 insertions(+), 76 deletions(-)
diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 4216ac17f43..4a7fc583c71 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -43,8 +43,8 @@ static text *MB_do_like_escape(text *pat, text *esc);
static int UTF8_MatchText(const char *t, int tlen, const char *p, int plen,
pg_locale_t locale);
-static int SB_IMatchText(const char *t, int tlen, const char *p, int plen,
- pg_locale_t locale);
+static int C_IMatchText(const char *t, int tlen, const char *p, int plen,
+ pg_locale_t locale);
static int GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation);
static int Generic_Text_IC_like(text *str, text *pat, Oid collation);
@@ -84,22 +84,10 @@ wchareq(const char *p1, const char *p2)
* of getting a single character transformed to the system's wchar_t format.
* So now, we just downcase the strings using lower() and apply regular LIKE
* comparison. This should be revisited when we install better locale support.
- */
-
-/*
- * We do handle case-insensitive matching for single-byte encodings using
+ *
+ * We do handle case-insensitive matching for the C locale using
* fold-on-the-fly processing, however.
*/
-static char
-SB_lower_char(unsigned char c, pg_locale_t locale)
-{
- if (locale->ctype_is_c)
- return pg_ascii_tolower(c);
- else if (locale->is_default)
- return pg_tolower(c);
- else
- return char_tolower(c, locale);
-}
#define NextByte(p, plen) ((p)++, (plen)--)
@@ -131,9 +119,9 @@ SB_lower_char(unsigned char c, pg_locale_t locale)
#include "like_match.c"
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
#define NextChar(p, plen) NextByte((p), (plen))
-#define MatchText SB_IMatchText
+#define MatchText C_IMatchText
#include "like_match.c"
@@ -202,22 +190,17 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
errmsg("nondeterministic collations are not supported for ILIKE")));
/*
- * For efficiency reasons, in the single byte case we don't call lower()
- * on the pattern and text, but instead call SB_lower_char on each
- * character. In the multi-byte case we don't have much choice :-(. Also,
- * ICU does not support single-character case folding, so we go the long
- * way.
+ * For efficiency reasons, in the C locale we don't call lower() on the
+ * pattern and text, but instead call SB_lower_char on each character.
*/
- if (locale->ctype_is_c ||
- (char_tolower_enabled(locale) &&
- pg_database_encoding_max_length() == 1))
+ if (locale->ctype_is_c)
{
p = VARDATA_ANY(pat);
plen = VARSIZE_ANY_EXHDR(pat);
s = VARDATA_ANY(str);
slen = VARSIZE_ANY_EXHDR(str);
- return SB_IMatchText(s, slen, p, plen, locale);
+ return C_IMatchText(s, slen, p, plen, locale);
}
else
{
@@ -229,10 +212,13 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
PointerGetDatum(str)));
s = VARDATA_ANY(str);
slen = VARSIZE_ANY_EXHDR(str);
+
if (GetDatabaseEncoding() == PG_UTF8)
return UTF8_MatchText(s, slen, p, plen, 0);
- else
+ else if (pg_database_encoding_max_length() > 1)
return MB_MatchText(s, slen, p, plen, 0);
+ else
+ return SB_MatchText(s, slen, p, plen, 0);
}
}
diff --git a/src/backend/utils/adt/like_match.c b/src/backend/utils/adt/like_match.c
index 892f8a745ea..54846c9541d 100644
--- a/src/backend/utils/adt/like_match.c
+++ b/src/backend/utils/adt/like_match.c
@@ -70,10 +70,14 @@
*--------------------
*/
+/*
+ * MATCH_LOWER is defined for ILIKE in the C locale as an optimization. Other
+ * locales must casefold the inputs before matching.
+ */
#ifdef MATCH_LOWER
-#define GETCHAR(t, locale) MATCH_LOWER(t, locale)
+#define GETCHAR(t) pg_ascii_tolower(t)
#else
-#define GETCHAR(t, locale) (t)
+#define GETCHAR(t) (t)
#endif
static int
@@ -105,7 +109,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
ereport(ERROR,
(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
errmsg("LIKE pattern must not end with escape character")));
- if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
+ if (GETCHAR(*p) != GETCHAR(*t))
return LIKE_FALSE;
}
else if (*p == '%')
@@ -167,14 +171,14 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
ereport(ERROR,
(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
errmsg("LIKE pattern must not end with escape character")));
- firstpat = GETCHAR(p[1], locale);
+ firstpat = GETCHAR(p[1]);
}
else
- firstpat = GETCHAR(*p, locale);
+ firstpat = GETCHAR(*p);
while (tlen > 0)
{
- if (GETCHAR(*t, locale) == firstpat || (locale && !locale->deterministic))
+ if (GETCHAR(*t) == firstpat || (locale && !locale->deterministic))
{
int matched = MatchText(t, tlen, p, plen, locale);
@@ -342,7 +346,7 @@ MatchText(const char *t, int tlen, const char *p, int plen, pg_locale_t locale)
NextChar(t1, t1len);
}
}
- else if (GETCHAR(*p, locale) != GETCHAR(*t, locale))
+ else if (GETCHAR(*p) != GETCHAR(*t))
{
/* non-wildcard pattern char fails to match text char */
return LIKE_FALSE;
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 9319fb633b6..b3afa6cad6c 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1618,28 +1618,6 @@ char_is_cased(char ch, pg_locale_t locale)
return locale->ctype->char_is_cased(ch, locale);
}
-/*
- * char_tolower_enabled()
- *
- * Does the provider support char_tolower()?
- */
-bool
-char_tolower_enabled(pg_locale_t locale)
-{
- return (locale->ctype->char_tolower != NULL);
-}
-
-/*
- * char_tolower()
- *
- * Convert char (single-byte encoding) to lowercase.
- */
-char
-char_tolower(unsigned char ch, pg_locale_t locale)
-{
- return locale->ctype->char_tolower(ch, locale);
-}
-
/*
* Return required encoding ID for the given locale, or -1 if any encoding is
* valid for the locale.
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index a55167b0697..feb63bbdad1 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -248,13 +248,6 @@ wc_isxdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
#endif
}
-static char
-char_tolower_libc(unsigned char ch, pg_locale_t locale)
-{
- Assert(pg_database_encoding_max_length() == 1);
- return tolower_l(ch, locale->lt);
-}
-
static bool
char_is_cased_libc(char ch, pg_locale_t locale)
{
@@ -338,7 +331,6 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -363,7 +355,6 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -384,7 +375,6 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
.char_is_cased = char_is_cased_libc,
- .char_tolower = char_tolower_libc,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
};
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 40e58cc52b8..e5aaf6422e8 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -124,13 +124,6 @@ struct ctype_methods
/* required */
bool (*char_is_cased) (char ch, pg_locale_t locale);
-
- /*
- * Optional. If defined, will only be called for single-byte encodings. If
- * not defined, or if the encoding is multibyte, will fall back to
- * pg_strlower().
- */
- char (*char_tolower) (unsigned char ch, pg_locale_t locale);
};
/*
@@ -182,8 +175,6 @@ extern pg_locale_t pg_newlocale_from_collation(Oid collid);
extern char *get_collation_actual_version(char collprovider, const char *collcollate);
extern bool char_is_cased(char ch, pg_locale_t locale);
-extern bool char_tolower_enabled(pg_locale_t locale);
-extern char char_tolower(unsigned char ch, pg_locale_t locale);
extern size_t pg_strlower(char *dst, size_t dstsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
--
2.43.0
From f8cf19f4764de42851f7b98ce652e8e2ece6af40 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Fri, 21 Nov 2025 12:14:21 -0800
Subject: [PATCH v9 08/11] Use multibyte-aware extraction of pattern prefixes.
Previously, like_fixed_prefix() used char-at-a-time logic, which
forced it to be too conservative for case-insensitive matching.
Now, use pg_wchar-at-a-time loop for text types, along with proper
detection of cased characters; and preserve and char-at-a-time logic
for bytea.
Removes the pg_locale_t char_is_cased() single-byte method and
replaces it with a proper multibyte pg_iswcased() method.
---
src/backend/utils/adt/like_support.c | 111 +++++++++++++---------
src/backend/utils/adt/pg_locale.c | 26 +++--
src/backend/utils/adt/pg_locale_builtin.c | 7 +-
src/backend/utils/adt/pg_locale_icu.c | 15 ++-
src/backend/utils/adt/pg_locale_libc.c | 23 +++--
src/include/utils/pg_locale.h | 5 +-
6 files changed, 103 insertions(+), 84 deletions(-)
diff --git a/src/backend/utils/adt/like_support.c b/src/backend/utils/adt/like_support.c
index 0debccfa67b..e7255fa652a 100644
--- a/src/backend/utils/adt/like_support.c
+++ b/src/backend/utils/adt/like_support.c
@@ -987,12 +987,11 @@ static Pattern_Prefix_Status
like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
Const **prefix_const, Selectivity *rest_selec)
{
- char *match;
char *patt;
int pattlen;
Oid typeid = patt_const->consttype;
- int pos,
- match_pos;
+ int pos;
+ int match_pos = 0;
pg_locale_t locale = 0;
/* the right-hand const is type text or bytea */
@@ -1020,67 +1019,91 @@ like_fixed_prefix(Const *patt_const, bool case_insensitive, Oid collation,
locale = pg_newlocale_from_collation(collation);
}
+ /* for text types, use pg_wchar; for BYTEA, use char */
if (typeid != BYTEAOID)
{
- patt = TextDatumGetCString(patt_const->constvalue);
- pattlen = strlen(patt);
+ text *val = DatumGetTextPP(patt_const->constvalue);
+ pg_wchar *wpatt;
+ pg_wchar *wmatch;
+ char *match;
+
+ patt = VARDATA_ANY(val);
+ pattlen = VARSIZE_ANY_EXHDR(val);
+ wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+ wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+ pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+ match = palloc(pattlen + 1);
+ for (pos = 0; pos < pattlen; pos++)
+ {
+ /* % and _ are wildcard characters in LIKE */
+ if (wpatt[pos] == '%' ||
+ wpatt[pos] == '_')
+ break;
+
+ /* Backslash escapes the next character */
+ if (wpatt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
+
+ /*
+ * For ILIKE, stop if it's a case-varying character (it's sort of
+ * a wildcard).
+ */
+ if (case_insensitive && pg_iswcased(wpatt[pos], locale))
+ break;
+
+ wmatch[match_pos++] = wpatt[pos];
+ }
+
+ wmatch[match_pos] = '\0';
+
+ pg_wchar2mb_with_len(wmatch, match, pattlen);
+
+ pfree(wpatt);
+ pfree(wmatch);
+
+ *prefix_const = string_to_const(match, typeid);
}
else
{
bytea *bstr = DatumGetByteaPP(patt_const->constvalue);
+ char *match;
+ patt = VARDATA_ANY(bstr);
pattlen = VARSIZE_ANY_EXHDR(bstr);
- patt = (char *) palloc(pattlen);
- memcpy(patt, VARDATA_ANY(bstr), pattlen);
- Assert((Pointer) bstr == DatumGetPointer(patt_const->constvalue));
- }
- match = palloc(pattlen + 1);
- match_pos = 0;
- for (pos = 0; pos < pattlen; pos++)
- {
- /* % and _ are wildcard characters in LIKE */
- if (patt[pos] == '%' ||
- patt[pos] == '_')
- break;
-
- /* Backslash escapes the next character */
- if (patt[pos] == '\\')
+ match = palloc(pattlen + 1);
+ for (pos = 0; pos < pattlen; pos++)
{
- pos++;
- if (pos >= pattlen)
+ /* % and _ are wildcard characters in LIKE */
+ if (patt[pos] == '%' ||
+ patt[pos] == '_')
break;
- }
- /*
- * Stop if case-varying character (it's sort of a wildcard).
- *
- * In multibyte character sets or with non-libc providers, we can't
- * use isalpha, and it does not seem worth trying to convert to
- * wchar_t or char32_t. Instead, just pass the single byte to the
- * provider, which will assume any non-ASCII char is potentially
- * case-varying.
- */
- if (case_insensitive && char_is_cased(patt[pos], locale))
- break;
-
- match[match_pos++] = patt[pos];
- }
+ /* Backslash escapes the next character */
+ if (patt[pos] == '\\')
+ {
+ pos++;
+ if (pos >= pattlen)
+ break;
+ }
- match[match_pos] = '\0';
+ match[match_pos++] = pos;
+ }
- if (typeid != BYTEAOID)
- *prefix_const = string_to_const(match, typeid);
- else
*prefix_const = string_to_bytea_const(match, match_pos);
+ pfree(match);
+ }
+
if (rest_selec != NULL)
*rest_selec = like_selectivity(&patt[pos], pattlen - pos,
case_insensitive);
- pfree(patt);
- pfree(match);
-
/* in LIKE, an empty pattern is an exact match! */
if (pos == pattlen)
return Pattern_Prefix_Exact; /* reached end of pattern, so exact */
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index b3afa6cad6c..6ec7a48f4c3 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1577,6 +1577,17 @@ pg_iswxdigit(pg_wchar wc, pg_locale_t locale)
return locale->ctype->wc_isxdigit(wc, locale);
}
+bool
+pg_iswcased(pg_wchar wc, pg_locale_t locale)
+{
+ /* for the C locale, Cased and Alpha are equivalent */
+ if (locale->ctype == NULL)
+ return (wc <= (pg_wchar) 127 &&
+ (pg_char_properties[wc] & PG_ISALPHA));
+ else
+ return locale->ctype->wc_iscased(wc, locale);
+}
+
pg_wchar
pg_towupper(pg_wchar wc, pg_locale_t locale)
{
@@ -1603,21 +1614,6 @@ pg_towlower(pg_wchar wc, pg_locale_t locale)
return locale->ctype->wc_tolower(wc, locale);
}
-/*
- * char_is_cased()
- *
- * Fuzzy test of whether the given char is case-varying or not. The argument
- * is a single byte, so in a multibyte encoding, just assume any non-ASCII
- * char is case-varying.
- */
-bool
-char_is_cased(char ch, pg_locale_t locale)
-{
- if (locale->ctype == NULL)
- return (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
- return locale->ctype->char_is_cased(ch, locale);
-}
-
/*
* Return required encoding ID for the given locale, or -1 if any encoding is
* valid for the locale.
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 1021e0d129b..0c2920112bb 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -186,10 +186,9 @@ wc_isxdigit_builtin(pg_wchar wc, pg_locale_t locale)
}
static bool
-char_is_cased_builtin(char ch, pg_locale_t locale)
+wc_iscased_builtin(pg_wchar wc, pg_locale_t locale)
{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
+ return pg_u_prop_cased(to_char32(wc));
}
static pg_wchar
@@ -219,7 +218,7 @@ static const struct ctype_methods ctype_methods_builtin = {
.wc_ispunct = wc_ispunct_builtin,
.wc_isspace = wc_isspace_builtin,
.wc_isxdigit = wc_isxdigit_builtin,
- .char_is_cased = char_is_cased_builtin,
+ .wc_iscased = wc_iscased_builtin,
.wc_tolower = wc_tolower_builtin,
.wc_toupper = wc_toupper_builtin,
};
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index f5a0cc8fe41..18d026deda8 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -121,13 +121,6 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
const char *locale,
UErrorCode *pErrorCode);
-static bool
-char_is_cased_icu(char ch, pg_locale_t locale)
-{
- return IS_HIGHBIT_SET(ch) ||
- (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z');
-}
-
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
* UChar32, which is correct for the UTF-8 encoding, but not in general.
@@ -223,6 +216,12 @@ wc_isxdigit_icu(pg_wchar wc, pg_locale_t locale)
return u_isxdigit(wc);
}
+static bool
+wc_iscased_icu(pg_wchar wc, pg_locale_t locale)
+{
+ return u_hasBinaryProperty(wc, UCHAR_CASED);
+}
+
static const struct ctype_methods ctype_methods_icu = {
.strlower = strlower_icu,
.strtitle = strtitle_icu,
@@ -238,7 +237,7 @@ static const struct ctype_methods ctype_methods_icu = {
.wc_ispunct = wc_ispunct_icu,
.wc_isspace = wc_isspace_icu,
.wc_isxdigit = wc_isxdigit_icu,
- .char_is_cased = char_is_cased_icu,
+ .wc_iscased = wc_iscased_icu,
.wc_toupper = toupper_icu,
.wc_tolower = tolower_icu,
};
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index feb63bbdad1..4c20797ad5c 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -184,6 +184,13 @@ wc_isxdigit_libc_sb(pg_wchar wc, pg_locale_t locale)
#endif
}
+static bool
+wc_iscased_libc_sb(pg_wchar wc, pg_locale_t locale)
+{
+ return isupper_l((unsigned char) wc, locale->lt) ||
+ islower_l((unsigned char) wc, locale->lt);
+}
+
static bool
wc_isdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
{
@@ -249,14 +256,10 @@ wc_isxdigit_libc_mb(pg_wchar wc, pg_locale_t locale)
}
static bool
-char_is_cased_libc(char ch, pg_locale_t locale)
+wc_iscased_libc_mb(pg_wchar wc, pg_locale_t locale)
{
- bool is_multibyte = pg_database_encoding_max_length() > 1;
-
- if (is_multibyte && IS_HIGHBIT_SET(ch))
- return true;
- else
- return isalpha_l((unsigned char) ch, locale->lt);
+ return iswupper_l((wint_t) wc, locale->lt) ||
+ iswlower_l((wint_t) wc, locale->lt);
}
static pg_wchar
@@ -330,7 +333,7 @@ static const struct ctype_methods ctype_methods_libc_sb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -354,7 +357,7 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.wc_ispunct = wc_ispunct_libc_sb,
.wc_isspace = wc_isspace_libc_sb,
.wc_isxdigit = wc_isxdigit_libc_sb,
- .char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_sb,
.wc_toupper = toupper_libc_sb,
.wc_tolower = tolower_libc_sb,
};
@@ -374,7 +377,7 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.wc_ispunct = wc_ispunct_libc_mb,
.wc_isspace = wc_isspace_libc_mb,
.wc_isxdigit = wc_isxdigit_libc_mb,
- .char_is_cased = char_is_cased_libc,
+ .wc_iscased = wc_iscased_libc_mb,
.wc_toupper = toupper_libc_mb,
.wc_tolower = tolower_libc_mb,
};
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index e5aaf6422e8..6dda56d1c3c 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -119,11 +119,9 @@ struct ctype_methods
bool (*wc_ispunct) (pg_wchar wc, pg_locale_t locale);
bool (*wc_isspace) (pg_wchar wc, pg_locale_t locale);
bool (*wc_isxdigit) (pg_wchar wc, pg_locale_t locale);
+ bool (*wc_iscased) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_toupper) (pg_wchar wc, pg_locale_t locale);
pg_wchar (*wc_tolower) (pg_wchar wc, pg_locale_t locale);
-
- /* required */
- bool (*char_is_cased) (char ch, pg_locale_t locale);
};
/*
@@ -211,6 +209,7 @@ extern bool pg_iswprint(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswpunct(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswspace(pg_wchar wc, pg_locale_t locale);
extern bool pg_iswxdigit(pg_wchar wc, pg_locale_t locale);
+extern bool pg_iswcased(pg_wchar wc, pg_locale_t locale);
extern pg_wchar pg_towupper(pg_wchar wc, pg_locale_t locale);
extern pg_wchar pg_towlower(pg_wchar wc, pg_locale_t locale);
--
2.43.0
From 6fe24276682edd24c4c9a20fb11747c095bd7744 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Wed, 19 Nov 2025 13:24:38 -0800
Subject: [PATCH v9 09/11] fuzzystrmatch: use pg_ascii_toupper().
fuzzystrmatch is designed for ASCII, so no need to rely on the global
LC_CTYPE setting.
---
contrib/fuzzystrmatch/dmetaphone.c | 2 +-
contrib/fuzzystrmatch/fuzzystrmatch.c | 16 ++++++++--------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/contrib/fuzzystrmatch/dmetaphone.c b/contrib/fuzzystrmatch/dmetaphone.c
index 6627b2b8943..bb5d3e90756 100644
--- a/contrib/fuzzystrmatch/dmetaphone.c
+++ b/contrib/fuzzystrmatch/dmetaphone.c
@@ -284,7 +284,7 @@ MakeUpper(metastring *s)
char *i;
for (i = s->str; *i; i++)
- *i = toupper((unsigned char) *i);
+ *i = pg_ascii_toupper((unsigned char) *i);
}
diff --git a/contrib/fuzzystrmatch/fuzzystrmatch.c b/contrib/fuzzystrmatch/fuzzystrmatch.c
index e7cc314b763..7f07efc2c35 100644
--- a/contrib/fuzzystrmatch/fuzzystrmatch.c
+++ b/contrib/fuzzystrmatch/fuzzystrmatch.c
@@ -62,7 +62,7 @@ static const char *const soundex_table = "01230120022455012623010202";
static char
soundex_code(char letter)
{
- letter = toupper((unsigned char) letter);
+ letter = pg_ascii_toupper((unsigned char) letter);
/* Defend against non-ASCII letters */
if (letter >= 'A' && letter <= 'Z')
return soundex_table[letter - 'A'];
@@ -124,7 +124,7 @@ getcode(char c)
{
if (isalpha((unsigned char) c))
{
- c = toupper((unsigned char) c);
+ c = pg_ascii_toupper((unsigned char) c);
/* Defend against non-ASCII letters */
if (c >= 'A' && c <= 'Z')
return _codes[c - 'A'];
@@ -301,18 +301,18 @@ metaphone(PG_FUNCTION_ARGS)
* accessing the array directly... */
/* Look at the next letter in the word */
-#define Next_Letter (toupper((unsigned char) word[w_idx+1]))
+#define Next_Letter (pg_ascii_toupper((unsigned char) word[w_idx+1]))
/* Look at the current letter in the word */
-#define Curr_Letter (toupper((unsigned char) word[w_idx]))
+#define Curr_Letter (pg_ascii_toupper((unsigned char) word[w_idx]))
/* Go N letters back. */
#define Look_Back_Letter(n) \
- (w_idx >= (n) ? toupper((unsigned char) word[w_idx-(n)]) : '\0')
+ (w_idx >= (n) ? pg_ascii_toupper((unsigned char) word[w_idx-(n)]) : '\0')
/* Previous letter. I dunno, should this return null on failure? */
#define Prev_Letter (Look_Back_Letter(1))
/* Look two letters down. It makes sure you don't walk off the string. */
#define After_Next_Letter \
- (Next_Letter != '\0' ? toupper((unsigned char) word[w_idx+2]) : '\0')
-#define Look_Ahead_Letter(n) toupper((unsigned char) Lookahead(word+w_idx, n))
+ (Next_Letter != '\0' ? pg_ascii_toupper((unsigned char) word[w_idx+2]) : '\0')
+#define Look_Ahead_Letter(n) pg_ascii_toupper((unsigned char) Lookahead(word+w_idx, n))
/* Allows us to safely look ahead an arbitrary # of letters */
@@ -742,7 +742,7 @@ _soundex(const char *instr, char *outstr)
}
/* Take the first letter as is */
- *outstr++ = (char) toupper((unsigned char) *instr++);
+ *outstr++ = (char) pg_ascii_toupper((unsigned char) *instr++);
count = 1;
while (*instr && count < SOUNDEX_LEN)
--
2.43.0
From 3ab850c32c4bfef9102117385ede98080d8cf4b6 Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 20 Oct 2025 16:32:18 -0700
Subject: [PATCH v9 10/11] downcase_identifier(): use method table from locale
provider.
Previously, libc's tolower() was always used for identifier case
folding, regardless of the database locale (though only characters
beyond 127 in single-byte encodings were affected). Refactor to allow
each provider to supply its own implementation of identifier
casefolding.
For historical compatibility, when using a single-byte encoding, ICU
still relies on tolower().
One minor behavior change is that, before the database default locale
is initialized, it uses ASCII semantics to fold the
identifiers. Previously, it would use the postmaster's LC_CTYPE
setting from the environment. While that could have some effect during
GUC processing, for example, it would have been fragile to rely on the
environment setting anyway. (Also, it only matters when the encoding
is single-byte.)
---
src/backend/parser/scansup.c | 39 +++++++---------
src/backend/utils/adt/pg_locale.c | 32 +++++++++++++
src/backend/utils/adt/pg_locale_builtin.c | 24 ++++++++++
src/backend/utils/adt/pg_locale_icu.c | 36 ++++++++++++++-
src/backend/utils/adt/pg_locale_libc.c | 55 +++++++++++++++++++++++
src/include/utils/pg_locale.h | 5 +++
6 files changed, 166 insertions(+), 25 deletions(-)
diff --git a/src/backend/parser/scansup.c b/src/backend/parser/scansup.c
index 2feb2b6cf5a..0bd049643d1 100644
--- a/src/backend/parser/scansup.c
+++ b/src/backend/parser/scansup.c
@@ -18,6 +18,7 @@
#include "mb/pg_wchar.h"
#include "parser/scansup.h"
+#include "utils/pg_locale.h"
/*
@@ -46,35 +47,25 @@ char *
downcase_identifier(const char *ident, int len, bool warn, bool truncate)
{
char *result;
- int i;
- bool enc_is_single_byte;
-
- result = palloc(len + 1);
- enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ size_t dstsize;
+ size_t needed pg_attribute_unused();
/*
- * SQL99 specifies Unicode-aware case normalization, which we don't yet
- * have the infrastructure for. Instead we use tolower() to provide a
- * locale-aware translation. However, there are some locales where this
- * is not right either (eg, Turkish may do strange things with 'i' and
- * 'I'). Our current compromise is to use tolower() for characters with
- * the high bit set, as long as they aren't part of a multi-byte
- * character, and use an ASCII-only downcasing for 7-bit characters.
+ * Preserves string length.
+ *
+ * NB: if we decide to support Unicode-aware identifier case folding, then
+ * we need to account for a change in string length.
*/
- for (i = 0; i < len; i++)
- {
- unsigned char ch = (unsigned char) ident[i];
+ dstsize = len + 1;
+ result = palloc(dstsize);
- if (ch >= 'A' && ch <= 'Z')
- ch += 'a' - 'A';
- else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
- ch = tolower(ch);
- result[i] = (char) ch;
- }
- result[i] = '\0';
+ needed = pg_strfold_ident(result, dstsize, ident, len);
+ Assert(needed + 1 == dstsize);
+ Assert(needed == len);
+ Assert(result[len] == '\0');
- if (i >= NAMEDATALEN && truncate)
- truncate_identifier(result, i, warn);
+ if (len >= NAMEDATALEN && truncate)
+ truncate_identifier(result, len, warn);
return result;
}
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 6ec7a48f4c3..68227367339 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1341,6 +1341,38 @@ pg_strfold(char *dst, size_t dstsize, const char *src, ssize_t srclen,
return locale->ctype->strfold(dst, dstsize, src, srclen, locale);
}
+/*
+ * Fold an identifier using the database default locale.
+ *
+ * For historical reasons, does not use ordinary locale behavior. Should only
+ * be used for identifier folding. XXX: can we make this equivalent to
+ * pg_strfold(..., default_locale)?
+ */
+size_t
+pg_strfold_ident(char *dest, size_t destsize, const char *src, ssize_t srclen)
+{
+ if (default_locale == NULL || default_locale->ctype == NULL)
+ {
+ int i;
+
+ for (i = 0; i < srclen && i < destsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dest[i] = (char) ch;
+ }
+
+ if (i < destsize)
+ dest[i] = '\0';
+
+ return srclen;
+ }
+ return default_locale->ctype->strfold_ident(dest, destsize, src, srclen,
+ default_locale);
+}
+
/*
* pg_strcoll
*
diff --git a/src/backend/utils/adt/pg_locale_builtin.c b/src/backend/utils/adt/pg_locale_builtin.c
index 0c2920112bb..659e588d513 100644
--- a/src/backend/utils/adt/pg_locale_builtin.c
+++ b/src/backend/utils/adt/pg_locale_builtin.c
@@ -125,6 +125,29 @@ strfold_builtin(char *dest, size_t destsize, const char *src, ssize_t srclen,
locale->builtin.casemap_full);
}
+static size_t
+strfold_ident_builtin(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+
+ Assert(GetDatabaseEncoding() == PG_UTF8);
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
static bool
wc_isdigit_builtin(pg_wchar wc, pg_locale_t locale)
{
@@ -208,6 +231,7 @@ static const struct ctype_methods ctype_methods_builtin = {
.strtitle = strtitle_builtin,
.strupper = strupper_builtin,
.strfold = strfold_builtin,
+ .strfold_ident = strfold_ident_builtin,
.wc_isdigit = wc_isdigit_builtin,
.wc_isalpha = wc_isalpha_builtin,
.wc_isalnum = wc_isalnum_builtin,
diff --git a/src/backend/utils/adt/pg_locale_icu.c b/src/backend/utils/adt/pg_locale_icu.c
index 18d026deda8..39b153a4262 100644
--- a/src/backend/utils/adt/pg_locale_icu.c
+++ b/src/backend/utils/adt/pg_locale_icu.c
@@ -61,6 +61,8 @@ static size_t strupper_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
static size_t strfold_icu(char *dest, size_t destsize, const char *src,
ssize_t srclen, pg_locale_t locale);
+static size_t strfold_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale);
static int strncoll_icu(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2,
pg_locale_t locale);
@@ -123,7 +125,7 @@ static int32_t u_strFoldCase_default(UChar *dest, int32_t destCapacity,
/*
* XXX: many of the functions below rely on casts directly from pg_wchar to
- * UChar32, which is correct for the UTF-8 encoding, but not in general.
+ * UChar32, which is correct for UTF-8 and LATIN1, but not in general.
*/
static pg_wchar
@@ -227,6 +229,7 @@ static const struct ctype_methods ctype_methods_icu = {
.strtitle = strtitle_icu,
.strupper = strupper_icu,
.strfold = strfold_icu,
+ .strfold_ident = strfold_ident_icu,
.wc_isdigit = wc_isdigit_icu,
.wc_isalpha = wc_isalpha_icu,
.wc_isalnum = wc_isalnum_icu,
@@ -564,6 +567,37 @@ strfold_icu(char *dest, size_t destsize, const char *src, ssize_t srclen,
return result_len;
}
+/*
+ * For historical compatibility, behavior is not multibyte-aware.
+ *
+ * NB: uses libc tolower() for single-byte encodings (also for historical
+ * compatibility), and therefore relies on the global LC_CTYPE setting.
+ */
+static size_t
+strfold_ident_icu(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+ bool enc_is_single_byte;
+
+ enc_is_single_byte = pg_database_encoding_max_length() == 1;
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ else if (enc_is_single_byte && IS_HIGHBIT_SET(ch) && isupper(ch))
+ ch = tolower(ch);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
/*
* strncoll_icu_utf8
*
diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index 4c20797ad5c..10d332888df 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -318,11 +318,64 @@ tolower_libc_mb(pg_wchar wc, pg_locale_t locale)
return wc;
}
+/*
+ * Characters A..Z always fold to a..z, even in the Turkish locale. Characters
+ * beyond 127 use tolower().
+ */
+static size_t
+strfold_ident_libc_sb(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ locale_t loc = locale->lt;
+ int i;
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ else if (IS_HIGHBIT_SET(ch) && isupper_l(ch, loc))
+ ch = tolower_l(ch, loc);
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
+/*
+ * For historical reasons, not multibyte-aware; uses plain ASCII semantics.
+ */
+static size_t
+strfold_ident_libc_mb(char *dst, size_t dstsize, const char *src,
+ ssize_t srclen, pg_locale_t locale)
+{
+ int i;
+
+ for (i = 0; i < srclen && i < dstsize; i++)
+ {
+ unsigned char ch = (unsigned char) src[i];
+
+ if (ch >= 'A' && ch <= 'Z')
+ ch += 'a' - 'A';
+ dst[i] = (char) ch;
+ }
+
+ if (i < dstsize)
+ dst[i] = '\0';
+
+ return srclen;
+}
+
static const struct ctype_methods ctype_methods_libc_sb = {
.strlower = strlower_libc_sb,
.strtitle = strtitle_libc_sb,
.strupper = strupper_libc_sb,
.strfold = strlower_libc_sb,
+ .strfold_ident = strfold_ident_libc_sb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -347,6 +400,7 @@ static const struct ctype_methods ctype_methods_libc_other_mb = {
.strtitle = strtitle_libc_mb,
.strupper = strupper_libc_mb,
.strfold = strlower_libc_mb,
+ .strfold_ident = strfold_ident_libc_mb,
.wc_isdigit = wc_isdigit_libc_sb,
.wc_isalpha = wc_isalpha_libc_sb,
.wc_isalnum = wc_isalnum_libc_sb,
@@ -367,6 +421,7 @@ static const struct ctype_methods ctype_methods_libc_utf8 = {
.strtitle = strtitle_libc_mb,
.strupper = strupper_libc_mb,
.strfold = strlower_libc_mb,
+ .strfold_ident = strfold_ident_libc_mb,
.wc_isdigit = wc_isdigit_libc_mb,
.wc_isalpha = wc_isalpha_libc_mb,
.wc_isalnum = wc_isalnum_libc_mb,
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 6dda56d1c3c..b5251d175a9 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -107,6 +107,9 @@ struct ctype_methods
size_t (*strfold) (char *dest, size_t destsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+ size_t (*strfold_ident) (char *dest, size_t destsize,
+ const char *src, ssize_t srclen,
+ pg_locale_t locale);
/* required */
bool (*wc_isdigit) (pg_wchar wc, pg_locale_t locale);
@@ -185,6 +188,8 @@ extern size_t pg_strupper(char *dst, size_t dstsize,
extern size_t pg_strfold(char *dst, size_t dstsize,
const char *src, ssize_t srclen,
pg_locale_t locale);
+extern size_t pg_strfold_ident(char *dst, size_t dstsize,
+ const char *src, ssize_t srclen);
extern int pg_strcoll(const char *arg1, const char *arg2, pg_locale_t locale);
extern int pg_strncoll(const char *arg1, ssize_t len1,
const char *arg2, ssize_t len2, pg_locale_t locale);
--
2.43.0
From c994f82e10a8910712683dd2d21679002d3697ab Mon Sep 17 00:00:00 2001
From: Jeff Davis <[email protected]>
Date: Mon, 24 Nov 2025 14:00:52 -0800
Subject: [PATCH v9 11/11] Control LC_COLLATE with GUC.
Now that the global LC_COLLATE setting is not used for any in-core
purpose at all (see commit 5e6e42e44f), allow it to be set with a
GUC. This may be useful for extensions or procedural languages that
still depend on the global LC_COLLATE setting.
---
src/backend/utils/adt/pg_locale.c | 59 +++++++++++++++++++
src/backend/utils/init/postinit.c | 2 +
src/backend/utils/misc/guc_parameters.dat | 9 +++
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 3 +
src/include/utils/guc_hooks.h | 2 +
src/include/utils/pg_locale.h | 1 +
7 files changed, 78 insertions(+)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 68227367339..143202abbad 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -81,6 +81,7 @@ extern pg_locale_t create_pg_locale_libc(Oid collid, MemoryContext context);
extern char *get_collation_actual_version_libc(const char *collcollate);
/* GUC settings */
+char *locale_collate;
char *locale_messages;
char *locale_monetary;
char *locale_numeric;
@@ -369,6 +370,64 @@ assign_locale_time(const char *newval, void *extra)
CurrentLCTimeValid = false;
}
+/*
+ * We allow LC_COLLATE to actually be set globally.
+ *
+ * Note: we normally disallow value = "" because it wouldn't have consistent
+ * semantics (it'd effectively just use the previous value). However, this
+ * is the value passed for PGC_S_DEFAULT, so don't complain in that case,
+ * not even if the attempted setting fails due to invalid environment value.
+ * The idea there is just to accept the environment setting *if possible*
+ * during startup, until we can read the proper value from postgresql.conf.
+ */
+bool
+check_locale_collate(char **newval, void **extra, GucSource source)
+{
+ int locale_enc;
+ int db_enc;
+
+ if (**newval == '\0')
+ {
+ if (source == PGC_S_DEFAULT)
+ return true;
+ else
+ return false;
+ }
+
+ locale_enc = pg_get_encoding_from_locale(*newval, true);
+ db_enc = GetDatabaseEncoding();
+
+ if (!(locale_enc == db_enc ||
+ locale_enc == PG_SQL_ASCII ||
+ db_enc == PG_SQL_ASCII ||
+ locale_enc == -1))
+ {
+ if (source == PGC_S_FILE)
+ {
+ guc_free(*newval);
+ *newval = guc_strdup(LOG, "C");
+ if (!*newval)
+ return false;
+ }
+ else if (source != PGC_S_TEST)
+ {
+ ereport(WARNING,
+ (errmsg("encoding mismatch"),
+ errdetail("Locale \"%s\" uses encoding \"%s\", which does not match database encoding \"%s\".",
+ *newval, pg_encoding_to_char(locale_enc), pg_encoding_to_char(db_enc))));
+ return false;
+ }
+ }
+
+ return check_locale(LC_COLLATE, *newval, NULL);
+}
+
+void
+assign_locale_collate(const char *newval, void *extra)
+{
+ (void) pg_perm_setlocale(LC_COLLATE, newval);
+}
+
/*
* We allow LC_MESSAGES to actually be set globally.
*
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 98f9598cd78..c99d57eba48 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -404,6 +404,8 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
* the pg_database tuple.
*/
SetDatabaseEncoding(dbform->encoding);
+ /* Reset lc_collate to check encoding, and fall back to C if necessary */
+ SetConfigOption("lc_collate", locale_collate, PGC_POSTMASTER, PGC_S_FILE);
/* Record it as a GUC internal option, too */
SetConfigOption("server_encoding", GetDatabaseEncodingName(),
PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 1128167c025..a3da16eadb1 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1450,6 +1450,15 @@
boot_val => 'PG_KRB_SRVTAB',
},
+{ name => 'lc_collate', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
+ short_desc => 'Sets the locale for text ordering in extensions.',
+ long_desc => 'An empty string means use the operating system setting.',
+ variable => 'locale_collate',
+ boot_val => '""',
+ check_hook => 'check_locale_collate',
+ assign_hook => 'assign_locale_collate',
+},
+
{ name => 'lc_messages', type => 'string', context => 'PGC_SUSET', group => 'CLIENT_CONN_LOCALE',
short_desc => 'Sets the language in which messages are displayed.',
long_desc => 'An empty string means use the operating system setting.',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dc9e2255f8a..19332e39e82 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -798,6 +798,8 @@
# encoding
# These settings are initialized by initdb, but they can be changed.
+#lc_collate = '' # locale for text ordering (only affects
+ # extensions)
#lc_messages = '' # locale for system error message
# strings
#lc_monetary = 'C' # locale for monetary formatting
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 92fe2f531f7..8b2e7bfab6f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1312,6 +1312,9 @@ setup_config(void)
conflines = replace_guc_value(conflines, "shared_buffers",
repltok, false);
+ conflines = replace_guc_value(conflines, "lc_collate",
+ lc_collate, false);
+
conflines = replace_guc_value(conflines, "lc_messages",
lc_messages, false);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 82ac8646a8d..8a20f76eec8 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -65,6 +65,8 @@ extern bool check_huge_page_size(int *newval, void **extra, GucSource source);
extern void assign_io_method(int newval, void *extra);
extern bool check_io_max_concurrency(int *newval, void **extra, GucSource source);
extern const char *show_in_hot_standby(void);
+extern bool check_locale_collate(char **newval, void **extra, GucSource source);
+extern void assign_locale_collate(const char *newval, void *extra);
extern bool check_locale_messages(char **newval, void **extra, GucSource source);
extern void assign_locale_messages(const char *newval, void *extra);
extern bool check_locale_monetary(char **newval, void **extra, GucSource source);
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index b5251d175a9..9c7371476c1 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -38,6 +38,7 @@
#define UNICODE_CASEMAP_BUFSZ (UNICODE_CASEMAP_LEN * sizeof(char32_t))
/* GUC settings */
+extern PGDLLIMPORT char *locale_collate;
extern PGDLLIMPORT char *locale_messages;
extern PGDLLIMPORT char *locale_monetary;
extern PGDLLIMPORT char *locale_numeric;
--
2.43.0