Re: Performance degradation in Index searches with special characters

Shiv Iyer Sun, 06 Oct 2024 13:02:31 -0700

Hi Andrey,

I have tried my best to answer your queries below:



### Performance Degradation with Special Characters in PostgreSQL

#### **Explanation**:
The performance degradation you're experiencing when using special
characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL
handles **collations**. By default, PostgreSQL uses locale-aware collations
(typically UTF-8) for string comparisons, which involve complex character
sorting and encoding. Special characters may be treated differently under
certain locales, resulting in slower comparisons, especially with longer
strings.

#### **Key Observations**:
1. **Alphabetic vs. Special Characters**:
   - Queries with alphabetic characters perform significantly faster than
those with special characters.
   - This is due to differences in how the collation handles comparisons
for alphabetic and non-alphabetic characters. Special characters often
require more computational resources for comparison, resulting in longer
execution times.

2. **Impact of Collation**:
   - When you switched the collation to **"C"**, the performance improved
substantially. This is because the `"C"` collation uses **byte-wise**
comparisons rather than locale-aware sorting, which simplifies the
comparison logic, especially for special characters.

3. **String Length Impact**:
   - As the string length increases, the performance degrades exponentially
when using special characters. This is due to the collation’s computational
complexity for each additional character comparison.

#### **Recommendations**:

1. **Use the "C" Collation**:
   - The most effective improvement you observed came from using the `"C"`
collation, which performs simple byte-by-byte comparisons. You can apply
the `"C"` collation at the query level or alter the column definition:
     ```sql
     SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
     ```
     or
     ```sql
     ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE
"C";
     ```

2. **Partial Indexing**:
   - Create an index with a specific collation for certain characters. This
allows PostgreSQL to use the optimized comparison method only when dealing
with special characters:
     ```sql
     CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
     ```

3. **Reduce String Length**:
   - If possible, reduce the string length for comparisons involving
special characters. The performance impact grows with longer strings due to
the nature of locale-aware collation.

4. **Use Functional Indexes**:
   - You can use functional indexes that apply transformations (e.g.,
trimming special characters or converting to ASCII) before comparison:
     ```sql
     CREATE INDEX idx_test_transformed ON test (lower(value));
     ```

5. **Optimize Queries**:
   - For repeated queries involving long strings with special characters,
try caching results or using materialized views where the collation
overhead is reduced by pre-computing.

6. **Custom Collations**:
   - If you need locale-aware sorting but can compromise on certain
aspects, consider creating a **custom collation** that simplifies special
character handling.

#### **Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of
certain special characters?**
   - Yes, this behavior is expected. Locale-aware collations can be complex
for special characters, leading to longer comparison times.

2. **Are there any recommendations to improve performance without changing
the column or database collation?**
   - Without changing the collation entirely, you can use:
     - **Query-level collation adjustments** (using the `"C"` collation in
specific queries).
     - **Partial indexes** with the `"C"` collation.
     - **Functional indexes** or optimizations like materialized views.

On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev <andrey.stikh...@gmail.com>
wrote:

> Dear PostgreSQL Community,
>
> I am facing significant performance issues when executing queries that
> involve string comparisons with special characters, such as <, #, !, @,
> etc., especially when dealing with long strings. The query execution time
> increases drastically when these characters are used, whereas queries with
> alphabetic characters do not show such degradation. This behavior is
> observed both on macOS (using the official postgres:17 image via Docker)
> and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
>
> Here is a minimal example:
>
> testdb=# SELECT version();
>
>                                                           version
>
>
>
> ---------------------------------------------------------------------------------------------------------------------------
>
>  PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
> compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
>
> (1 row)
>
> testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
> CREATE TABLE
> Time: 3.562 ms
>
> testdb=# CREATE INDEX idx_test ON test (value);
> CREATE INDEX
> Time: 3.080 ms
>
> testdb=# INSERT INTO test (value) VALUES ('<');
> INSERT 0 1
> Time: 3.365 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
>  ?column?
> ----------
> (0 rows)
> Time: 4454.535 ms (00:04.455)
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
>  ?column?
> ----------
> (0 rows)
> Time: 3.772 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
>  ?column?
> ----------
> (0 rows)
> Time: 4.352 ms
>
>
> Time: 9.503 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
>
>  ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 3.299 ms
>
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);
>
>  ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 77.171 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);
>
>  ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 325.190 ms
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);
>
>  ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 1154.850 ms (00:01.155)
>
> testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);
>
>  ?column?
>
> ----------
>
> (0 rows)
>
>
> Time: 4490.206 ms (00:04.490)
>
>
> testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal)
> SELECT 1 FROM test WHERE value = repeat('<', 65000);
>
>
> ---------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4)
> (actual time=4425.459..4425.459 rows=0 loops=1)
>    Output: 1
>    Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a
> lot symbols ...
>
>  
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
>    Buffers: shared hit=1
>    ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0)
> (actual time=4425.432..4425.432 rows=0 loops=1)
>          Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
>          Buffers: shared hit=1
>  Planning Time: 1.082 ms
>  Execution Time: 4425.602 ms
> (9 rows)
>
> Time: 4433.001 ms (00:04.433)
>
> Observations*:*
>
>    - The performance degradation occurs with certain special characters
>    like < , !,  >, @ , #, ... .
>    - Queries using alphabetic characters or appending/prepending
>    characters execute much faster.
>    - The execution time increases exponentially with the length of the
>    string composed of special characters.
>    - Changing the collation to 'C' in the query significantly improves
>    performance.
>
> Questions*:*
>
>
>    - Is this performance degradation expected due to collation handling
>    of certain special characters in PostgreSQL?
>    - Are there any recommendations to improve performance without
>    changing the column or database collation?
>
> --
> Best regards,
> Andrey Stikheev
>


-- 

*Best Regards *
*Shiv Iyer *

Re: Performance degradation in Index searches with special characters

Reply via email to