Hi, Tom! Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like < in the glibc strcoll_l function. This was confirmed during my testing on Debian 12 with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres).
My test database settings: test_db=# \d+ test Table "public.test" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+-----------------------+-----------+----------+---------+----------+-------------+--------------+------------- value | character varying(10) | | not null | | extended | | | Indexes: "idx_test" btree (value) Access method: heap test_db=# \l test_db List of databases Name | Owner | Encoding | Locale Provider | Collate | Ctype | Locale | ICU Rules | Access privileges ---------+----------+----------+-----------------+------------+------------+--------+-----------+------------------- test_db | postgres | UTF8 | libc | en_US.utf8 | en_US.utf8 | | | (1 row) strcoll_l tests: root@715b19170a89:~# cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)" VERSION_CODENAME=bookworm ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" root@715b19170a89:~# ldd --version ldd (Debian GLIBC 2.36-9+deb12u8) 2.36 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Roland McGrath and Ulrich Drepper. root@715b19170a89:~# cat test.c #include <stdio.h> #include <stdlib.h> #include <locale.h> #include <string.h> int main() { char str1[] = "<"; char *str2 = malloc(65536); // 65535 characters + 1 for null terminator if (!str2) return 1; memset(str2, '<', 65535); str2[65535] = '\0'; locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL); int result = strcoll_l(str1, str2, locale); printf("Comparison result: %d\n", result); freelocale(locale); free(str2); return 0; } root@715b19170a89:~# time ./test Comparison result: -1 real 0m4.487s user 0m4.483s sys 0m0.003s I'm considering switching to ICU collations, as they might handle this more efficiently. However, as I know ICU isn’t the default collation provider in PostgreSQL, and switching to it in a live environment isn’t a straightforward process. The main concern is that glibc’s default collation (en_US.UTF-8) is widely used, and this opens up the potential for a Denial of Service (DoS) attack. For instance, if user input includes long strings of repeated characters like <, it can severely degrade performance due to the extended processing time for string comparisons, especially in high-traffic scenarios. On Sun, 6 Oct 2024 at 19:39, Tom Lane <t...@sss.pgh.pa.us> wrote: > Andrey Stikheev <andrey.stikh...@gmail.com> writes: > > - Changing the collation to 'C' in the query significantly improves > > performance. > > What collation are you using, pray tell? (And what database encoding?) > > - Is this performance degradation expected due to collation handling > of > > certain special characters in PostgreSQL? > > It seems like a performance bug indeed, but not ours --- I'm thinking > it must be in strcoll() or ICU. > Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see > something similar but not quite as bad: > > u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536); > ?column? > ---------- > (0 rows) > > Time: 1383.033 ms (00:01.383) > > Poking into it with gdb says that the time is indeed spent inside > strcoll: > > #0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 > "", > table=0x7fabb490e2b0, weights=<optimized out>, > rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4, > seq=<synthetic pointer>) at strcoll_l.c:111 > #1 __GI___strcoll_l (s1=0x1785878 "<", > s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>) > at strcoll_l.c:338 > #2 0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1, > arg2=<optimized out>, len2=65536, locale=<optimized out>, > locale=<optimized out>) at pg_locale.c:1964 > #3 0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1, > arg2=0x17958cc '<' <repeats 200 times>..., len2=65536, > collid=<optimized out>) at varlena.c:1567 > #4 0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at > varlena.c:1820 > #5 0x00000000009d75fa in FunctionCall2Coll ( > flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>, > arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161 > #6 0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, > key=0x7ffddd3b10c0, > page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762 > #7 0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0, > key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394 > > It's not the fault of the index machinery, because a single comparison > takes the same amount of time: > > u8=# select '<' <= repeat('<', 65536); > ?column? > ---------- > t > (1 row) > > Time: 1391.550 ms (00:01.392) > > I didn't try it with ICU. > > regards, tom lane > -- Best regards, Andrey Stikheev