Re: Performance degradation in Index searches with special characters

Andrey Stikheev Sun, 06 Oct 2024 10:39:13 -0700

Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12  with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres).


My test database settings:

test_db=# \d+ test

                                                  Table "public.test"

 Column |         Type          | Collation | Nullable | Default | Storage  |
Compression | Stats target | Description

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

 value  | character varying(10) |           | not null |         | extended
|             |              |

Indexes:

    "idx_test" btree (value)

Access method: heap


test_db=# \l test_db

                                                 List of databases

  Name   |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    |
Locale | ICU Rules | Access privileges

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

 test_db | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |
      |           |

(1 row)


strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/";

SUPPORT_URL="https://www.debian.org/support";

BUG_REPORT_URL="https://bugs.debian.org/";

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>


int main() {

    char str1[] = "<";

    char *str2 = malloc(65536); // 65535 characters + 1 for null terminator


    if (!str2) return 1;


    memset(str2, '<', 65535);

    str2[65535] = '\0';


    locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

    int result = strcoll_l(str1, str2, locale);


    printf("Comparison result: %d\n", result);


    freelocale(locale);

    free(str2);


    return 0;

}


root@715b19170a89:~# time ./test

Comparison result: -1


real 0m4.487s

user 0m4.483s

sys 0m0.003s

I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <[email protected]> wrote:

> Andrey Stikheev <[email protected]> writes:
> >    - Changing the collation to 'C' in the query significantly improves
> >    performance.
>
> What collation are you using, pray tell?  (And what database encoding?)


> >    - Is this performance degradation expected due to collation handling
> of
> >    certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.


> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
>  ?column?
> ----------
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
>     table=0x7fabb490e2b0, weights=<optimized out>,
>     rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
>     seq=<synthetic pointer>) at strcoll_l.c:111
> #1  __GI___strcoll_l (s1=0x1785878 "<",
>     s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
>     at strcoll_l.c:338
> #2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
>     arg2=<optimized out>, len2=65536, locale=<optimized out>,
>     locale=<optimized out>) at pg_locale.c:1964
> #3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
>     arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
>     collid=<optimized out>) at varlena.c:1567
> #4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5  0x00000000009d75fa in FunctionCall2Coll (
>     flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
>     arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
> #6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
>     page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
> #7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
>     key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394
>
> It's not the fault of the index machinery, because a single comparison
> takes the same amount of time:
>
> u8=# select '<' <= repeat('<', 65536);
>  ?column?
> ----------
>  t
> (1 row)
>
> Time: 1391.550 ms (00:01.392)
>
> I didn't try it with ICU.
>
>                         regards, tom lane
>


-- 
Best regards,
Andrey Stikheev

Re: Performance degradation in Index searches with special characters

Reply via email to