Hi Andrey, I have tried my best to answer your queries below:
### Performance Degradation with Special Characters in PostgreSQL #### **Explanation**: The performance degradation you're experiencing when using special characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL handles **collations**. By default, PostgreSQL uses locale-aware collations (typically UTF-8) for string comparisons, which involve complex character sorting and encoding. Special characters may be treated differently under certain locales, resulting in slower comparisons, especially with longer strings. #### **Key Observations**: 1. **Alphabetic vs. Special Characters**: - Queries with alphabetic characters perform significantly faster than those with special characters. - This is due to differences in how the collation handles comparisons for alphabetic and non-alphabetic characters. Special characters often require more computational resources for comparison, resulting in longer execution times. 2. **Impact of Collation**: - When you switched the collation to **"C"**, the performance improved substantially. This is because the `"C"` collation uses **byte-wise** comparisons rather than locale-aware sorting, which simplifies the comparison logic, especially for special characters. 3. **String Length Impact**: - As the string length increases, the performance degrades exponentially when using special characters. This is due to the collation’s computational complexity for each additional character comparison. #### **Recommendations**: 1. **Use the "C" Collation**: - The most effective improvement you observed came from using the `"C"` collation, which performs simple byte-by-byte comparisons. You can apply the `"C"` collation at the query level or alter the column definition: ```sql SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C"; ``` or ```sql ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE "C"; ``` 2. **Partial Indexing**: - Create an index with a specific collation for certain characters. This allows PostgreSQL to use the optimized comparison method only when dealing with special characters: ```sql CREATE INDEX idx_test_special_chars ON test (value COLLATE "C"); ``` 3. **Reduce String Length**: - If possible, reduce the string length for comparisons involving special characters. The performance impact grows with longer strings due to the nature of locale-aware collation. 4. **Use Functional Indexes**: - You can use functional indexes that apply transformations (e.g., trimming special characters or converting to ASCII) before comparison: ```sql CREATE INDEX idx_test_transformed ON test (lower(value)); ``` 5. **Optimize Queries**: - For repeated queries involving long strings with special characters, try caching results or using materialized views where the collation overhead is reduced by pre-computing. 6. **Custom Collations**: - If you need locale-aware sorting but can compromise on certain aspects, consider creating a **custom collation** that simplifies special character handling. #### **Answers to Your Questions**: 1. **Is the performance degradation expected due to collation handling of certain special characters?** - Yes, this behavior is expected. Locale-aware collations can be complex for special characters, leading to longer comparison times. 2. **Are there any recommendations to improve performance without changing the column or database collation?** - Without changing the collation entirely, you can use: - **Query-level collation adjustments** (using the `"C"` collation in specific queries). - **Partial indexes** with the `"C"` collation. - **Functional indexes** or optimizations like materialized views. On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev <andrey.stikh...@gmail.com> wrote: > Dear PostgreSQL Community, > > I am facing significant performance issues when executing queries that > involve string comparisons with special characters, such as <, #, !, @, > etc., especially when dealing with long strings. The query execution time > increases drastically when these characters are used, whereas queries with > alphabetic characters do not show such degradation. This behavior is > observed both on macOS (using the official postgres:17 image via Docker) > and on an Ubuntu 20.04 server running PostgreSQL in an LXC container. > > Here is a minimal example: > > testdb=# SELECT version(); > > version > > > > --------------------------------------------------------------------------------------------------------------------------- > > PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, > compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit > > (1 row) > > testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL); > CREATE TABLE > Time: 3.562 ms > > testdb=# CREATE INDEX idx_test ON test (value); > CREATE INDEX > Time: 3.080 ms > > testdb=# INSERT INTO test (value) VALUES ('<'); > INSERT 0 1 > Time: 3.365 ms > > testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536); > ?column? > ---------- > (0 rows) > Time: 4454.535 ms (00:04.455) > > testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536); > ?column? > ---------- > (0 rows) > Time: 3.772 ms > > testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a'; > ?column? > ---------- > (0 rows) > Time: 4.352 ms > > > Time: 9.503 ms > > testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C"; > > ?column? > > ---------- > > (0 rows) > > > Time: 3.299 ms > > > testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192); > > ?column? > > ---------- > > (0 rows) > > > Time: 77.171 ms > > testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384); > > ?column? > > ---------- > > (0 rows) > > > Time: 325.190 ms > > testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768); > > ?column? > > ---------- > > (0 rows) > > > Time: 1154.850 ms (00:01.155) > > testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536); > > ?column? > > ---------- > > (0 rows) > > > Time: 4490.206 ms (00:04.490) > > > testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal) > SELECT 1 FROM test WHERE value = repeat('<', 65000); > > > --------------------------------------------------------------------------------------------- > Bitmap Heap Scan on public.test (cost=4.20..13.67 rows=6 width=4) > (actual time=4425.459..4425.459 rows=0 loops=1) > Output: 1 > Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a > lot symbols ... > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text) > Buffers: shared hit=1 > -> Bitmap Index Scan on idx_test (cost=0.00..4.20 rows=6 width=0) > (actual time=4425.432..4425.432 rows=0 loops=1) > Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< > ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text) > Buffers: shared hit=1 > Planning Time: 1.082 ms > Execution Time: 4425.602 ms > (9 rows) > > Time: 4433.001 ms (00:04.433) > > Observations*:* > > - The performance degradation occurs with certain special characters > like < , !, >, @ , #, ... . > - Queries using alphabetic characters or appending/prepending > characters execute much faster. > - The execution time increases exponentially with the length of the > string composed of special characters. > - Changing the collation to 'C' in the query significantly improves > performance. > > Questions*:* > > > - Is this performance degradation expected due to collation handling > of certain special characters in PostgreSQL? > - Are there any recommendations to improve performance without > changing the column or database collation? > > -- > Best regards, > Andrey Stikheev > -- *Best Regards * *Shiv Iyer *