[HACKERS] ICU locales and text/char(n) SortSupport on Windows

Peter Geoghegan Sat, 16 Sep 2017 15:35:42 -0700

varstr_sortsupport() only allows Windows to use SortSupport with a
non-C-locale (when the server encoding happens to be UTF-8, which I
assume is the common case). This is because we (quite reasonably)
don't want to have to duplicate the ugly UTF-8 to UTF-16 conversion
hack from varstr_cmp() for the SortSupport authoritative comparator
(varstrfastcmp_locale() shouldn't get its own copy of this kludge,
because it's supposed to be "fast"). This broad restriction made sense
when Windows + UTF-8 + non-C-locale necessarily required the
aforementioned UTF-16 conversion kludge. However, iff an ICU locale is
in use on Windows (or any other platform), then we can always use
SortSupport, regardless of anything else (we should not have the core
code install a fmgr comparison shim that just calls varstr_cmp(),
though we still do). We don't actually need the UTF-16 kludge at all,
so we can use SortSupport without any special care.


The current state of affairs doesn't make any sense, AFAICT, and so
the restriction should be removed on general principle: we *already*
expect ICU to have no restrictions that are peculiar to Windows, as we
see in varstr_cmp() and str_tolower(). It's just arbitrary to hold on
to this restriction. This restriction also seems worth fixing because
Windows users are generally more likely to want to use ICU locales;
most of them would otherwise end up actually paying the overhead for
the UTF-16 kludge. (Presumably the UTF-16 conversion makes text
sorting *even slower* than it would be if we merely didn't do
SortSupport, which is to say: very slow indeed.)

In summary, we're currently attaching the use of SortSupport to the
wrong thing. We're treating this UTF-16 business as something that
implies a broad OS/platform restriction, when in fact it should be
treated as implying a restriction for one particular collation
provider only (a collation provider that happens to be built into
Windows, but isn't really special to us).

Attached patch shows what I'm getting at. This is untested, since I
don't use Windows. Proceed with caution.

On a related note, am I the only one that finds it questionable that
str_tolower() has an "#ifdef USE_WIDE_UPPER_LOWER" block that itself
contains an "#ifdef USE_ICU" block? It seems like those two things
might get conflated on some platforms. We don't want lower() to ever
not use the ICU infrastructure when an ICU collation is used, and yet
it's not obvious that that's impossible. I understand that the code in
regc_pg_locale.c kind of insists on using USE_WIDE_UPPER_LOWER
facilities, and that that was always accepted as legacy that ICU had
to live with. Maybe a static assertion is all that we need here (ICU
builds must also be USE_WIDE_UPPER_LOWER builds).

-- 
Peter Geoghegan

From fca07aca51fb7979f19180712707233ff0e6f4b4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Sat, 16 Sep 2017 13:36:25 -0700
Subject: [PATCH] Allow ICU to use SortSupport on Windows with UTF-8.

There is no reason to ever prevent the use of SortSupport on Windows
when ICU locales are used.  We previously avoided SortSupport on Windows
with UTF-8 server encoding and a non C-locale due to restrictions in
Windows' libc functionality.

This is now considered to be a restriction in one platform's libc
collation provider, and not a more general platform restriction.
---
 src/backend/utils/adt/varlena.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index ebfb823..ad8ebed 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1828,7 +1828,8 @@ varstr_sortsupport(SortSupport ssup, Oid collid, bool bpchar)
 	 * UTF-8 and we are not using the C collation, complex hacks are required.
 	 * We don't currently have a comparator that handles that case, so we fall
 	 * back on the slow method of having the sort code invoke bttextcmp() (in
-	 * the case of text) via the fmgr trampoline.
+	 * the case of text) via the fmgr trampoline.  (ICU locales work just the
+	 * same on Windows, however.)
 	 */
 	if (lc_collate_is_c(collid))
 	{
@@ -1840,7 +1841,8 @@ varstr_sortsupport(SortSupport ssup, Oid collid, bool bpchar)
 		collate_c = true;
 	}
 #ifdef WIN32
-	else if (GetDatabaseEncoding() == PG_UTF8)
+	else if (GetDatabaseEncoding() == PG_UTF8 &&
+			 !(locale && locale->provider == COLLPROVIDER_ICU))
 		return;
 #endif
 	else
-- 
2.7.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ICU locales and text/char(n) SortSupport on Windows

Reply via email to