Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12  with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test

                                                  Table "public.test"

 Column |         Type          | Collation | Nullable | Default | Storage  |
Compression | Stats target | Description

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

 value  | character varying(10) |           | not null |         | extended
|             |              |

Indexes:

    "idx_test" btree (value)

Access method: heap


test_db=# \l test_db

                                                 List of databases

  Name   |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    |
Locale | ICU Rules | Access privileges

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

 test_db | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |
      |           |

(1 row)


strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/";

SUPPORT_URL="https://www.debian.org/support";

BUG_REPORT_URL="https://bugs.debian.org/";

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>


int main() {

    char str1[] = "<";

    char *str2 = malloc(65536); // 65535 characters + 1 for null terminator


    if (!str2) return 1;


    memset(str2, '<', 65535);

    str2[65535] = '\0';


    locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

    int result = strcoll_l(str1, str2, locale);


    printf("Comparison result: %d\n", result);


    freelocale(locale);

    free(str2);


    return 0;

}


root@715b19170a89:~# time ./test

Comparison result: -1


real 0m4.487s

user 0m4.483s

sys 0m0.003s

I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Andrey Stikheev <andrey.stikh...@gmail.com> writes:
> >    - Changing the collation to 'C' in the query significantly improves
> >    performance.
>
> What collation are you using, pray tell?  (And what database encoding?)


> >    - Is this performance degradation expected due to collation handling
> of
> >    certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.


> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
>  ?column?
> ----------
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
>     table=0x7fabb490e2b0, weights=<optimized out>,
>     rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
>     seq=<synthetic pointer>) at strcoll_l.c:111
> #1  __GI___strcoll_l (s1=0x1785878 "<",
>     s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
>     at strcoll_l.c:338
> #2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
>     arg2=<optimized out>, len2=65536, locale=<optimized out>,
>     locale=<optimized out>) at pg_locale.c:1964
> #3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
>     arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
>     collid=<optimized out>) at varlena.c:1567
> #4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5  0x00000000009d75fa in FunctionCall2Coll (
>     flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
>     arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
> #6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
>     page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
> #7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
>     key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394
>
> It's not the fault of the index machinery, because a single comparison
> takes the same amount of time:
>
> u8=# select '<' <= repeat('<', 65536);
>  ?column?
> ----------
>  t
> (1 row)
>
> Time: 1391.550 ms (00:01.392)
>
> I didn't try it with ICU.
>
>                         regards, tom lane
>


-- 
Best regards,
Andrey Stikheev

Reply via email to