Hi,
After the release of the PostgreSQL 18 version, I am trying non
determisinstic collation with LIKE pattern matching support. I am mostly
searching with "LIKE %search_term%" on about 10 text columns. As I use
wildcard prefix and suffix, I can't use btree index anyways. So I decided
to try non deterministic collation support so I can simplify application
code. I am testing this on a table with ~60K rows. With this row count and
search pattern, non deterministic collation seems at least 10 times slower
than LOWER LIKE and ILIKE. Tested collations are the same ICU tr-TR
collation with standard and custom based on same ICU tr-TR:
40960 | test3e | 2200 | 10 | i |
f | -1 | | |
tr-TR-u-ks-level2 | | 153.128.46
Version:
etukimlik_admin=# select version();
version
--------------------------------------------------------------------------------------------------------------------
PostgreSQL 18.1 (Debian 18.1-1.pgdg13+2) on x86_64-pc-linux-gnu, compiled
by gcc (Debian 14.2.0-19) 14.2.0, 64-bit
(1 row)
Table structure with ICU deterministic collation:
etukimlik_admin=# \d "AccountActivityLogEntries"
Table
"public.AccountActivityLogEntries"
Column | Type | Collation |
Nullable | Default
--------------------------+--------------------------+-------------+----------+--------------------------------------------------------------------------
Id | integer | | not
null | generated by default as identity
Email | character varying(128) | tr-TR-x-icu | not
null |
LdapUid | character varying(512) | tr-TR-x-icu |
|
Identity | character varying(24) | tr-TR-x-icu |
|
StudentId | text[] | |
|
EmployeeId | character varying(24) | tr-TR-x-icu |
|
Name | character varying(1024) | tr-TR-x-icu |
|
Surname | character varying(1024) | tr-TR-x-icu |
|
AccountType | text[] | | not
null |
Description | character varying(4096) | tr-TR-x-icu | not
null |
ActivityId | character varying(128) | tr-TR-x-icu | not
null |
LogTypeCode | character varying(10) | tr-TR-x-icu | not
null |
LogType | character varying(64) | tr-TR-x-icu | not
null |
LogSubType | character varying(64) | tr-TR-x-icu | not
null |
ActivityTime | timestamp with time zone | |
|
ActivitySource | character varying(128) | tr-TR-x-icu | not
null |
ChangedAttribute | character varying(128) | tr-TR-x-icu |
|
ChangedAttributeOldValue | character varying(4096) | tr-TR-x-icu |
|
ChangedAttributeNewValue | character varying(4096) | tr-TR-x-icu |
|
DisplayName | character varying(2049) | tr-TR-x-icu |
| generated always as (((("Name"::text || ' '::text) ||
"Surname"::text)))
Indexes:
"PK_AccountActivityLogEntries" PRIMARY KEY, btree ("Id")
LOWER LIKE query test:
etukimlik_admin=# explain analyze select * FROM "AccountActivityLogEntries"
where lower("Email") like '%duran%' or lower("LdapUid") like '%duran%' or
lower("Identity") like '%duran%' or lower("Name") like '%duran%' or
lower("Surname") like '%duran%' or lower("DisplayName") like '%duran%' or
lower("Description") like '%duran%' or lower("EmployeeId") like '%duran%'
or exists ( select from unnest("StudentId") AS value WHERE lower(value)
LIKE '%duran%') or lower("ActivityId") like '%duran%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on "AccountActivityLogEntries" (cost=0.00..16387.18 rows=31210
width=5251) (actual time=1.952..282.090 rows=102.00 loops=1)
Filter: ((lower(("Email")::text) ~~ '%duran%'::text) OR
(lower(("LdapUid")::text) ~~ '%duran%'::text) OR (lower(("Identity")::text)
~~ '%duran%'::text) OR (lower(("Name")::text) ~~ '%duran%'::text) OR
(lower(("Surname")::text) ~~ '%duran%'::text) OR (lower((((("Name")::text
|| ' '::text) || ("Surname")::text))::text) ~~ '%duran%'::text) OR
(lower(("Description")::text) ~~ '%duran%'::text) OR
(lower(("EmployeeId")::text) ~~ '%duran%'::text) OR EXISTS(SubPlan 1) OR
(lower(("ActivityId")::text) ~~ '%duran%'::text))
Rows Removed by Filter: 61438
Buffers: shared hit=2922
SubPlan 1
-> Function Scan on unnest value (cost=0.00..0.15 rows=1 width=0)
(actual time=0.001..0.001 rows=0.00 loops=61438)
Filter: (lower(value) ~~ '%duran%'::text)
Rows Removed by Filter: 1
Planning Time: 0.165 ms
Execution Time: 282.143 ms
(10 rows)
ILIKE query test:
explain analyze select * FROM "AccountActivityLogEntries" where "Email"
ilike '%duran%' or "LdapUid" ilike '%duran%' or "Identity" ilike '%duran%'
or "Name" ilike '%duran%' or "Surname" ilike '%duran%' or "DisplayName"
ilike '%duran%' or "Description" ilike '%duran%' or "EmployeeId" ilike
'%duran%' or exists ( select from unnest("StudentId") AS value WHERE value
iLIKE '%duran%') or "ActivityId" ilike '%duran%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on "AccountActivityLogEntries" (cost=0.00..13464.03 rows=31210
width=5251) (actual time=3.782..417.573 rows=102.00 loops=1)
Filter: ((("Email")::text ~~* '%duran%'::text) OR (("LdapUid")::text ~~*
'%duran%'::text) OR (("Identity")::text ~~* '%duran%'::text) OR
(("Name")::text ~~* '%duran%'::text) OR (("Surname")::text ~~*
'%duran%'::text) OR ((((("Name")::text || ' '::text) ||
("Surname")::text))::text ~~* '%duran%'::text) OR (("Description")::text
~~* '%duran%'::text) OR (("EmployeeId")::text ~~* '%duran%'::text) OR
EXISTS(SubPlan 1) OR (("ActivityId")::text ~~* '%duran%'::text))
Rows Removed by Filter: 61438
Buffers: shared hit=2922
SubPlan 1
-> Function Scan on unnest value (cost=0.00..0.13 rows=1 width=0)
(actual time=0.001..0.001 rows=0.00 loops=61438)
Filter: (value ~~* '%duran%'::text)
Rows Removed by Filter: 1
Planning Time: 0.160 ms
Execution Time: 417.652 ms
Table structure with ICU non deterministic collation:
etukimlik_admin=# \d "AccountActivityLogEntries"
Table
"public.AccountActivityLogEntries"
Column | Type | Collation | Nullable
| Default
--------------------------+--------------------------+-----------+----------+--------------------------------------------------------------------------
Id | integer | | not null
| generated by default as identity
Email | character varying(128) | test3e | not null
|
LdapUid | character varying(512) | test3e |
|
Identity | character varying(24) | test3e |
|
StudentId | text[] | |
|
EmployeeId | character varying(24) | test3e |
|
Name | character varying(1024) | test3e |
|
Surname | character varying(1024) | test3e |
|
AccountType | text[] | | not null
|
Description | character varying(4096) | test3e | not null
|
ActivityId | character varying(128) | test3e | not null
|
LogTypeCode | character varying(10) | test3e | not null
|
LogType | character varying(64) | test3e | not null
|
LogSubType | character varying(64) | test3e | not null
|
ActivityTime | timestamp with time zone | |
|
ActivitySource | character varying(128) | test3e | not null
|
ChangedAttribute | character varying(128) | test3e |
|
ChangedAttributeOldValue | character varying(4096) | test3e |
|
ChangedAttributeNewValue | character varying(4096) | test3e |
|
DisplayName | character varying(2049) | test3e |
| generated always as (((("Name"::text || ' '::text) || "Surname"::text)))
Indexes:
"PK_AccountActivityLogEntries" PRIMARY KEY, btree ("Id")
LIKE query test:
etukimlik_admin=# explain analyze select * FROM "AccountActivityLogEntries"
where "Email" like '%duran%' or "LdapUid" like '%duran%' or "Identity" like
'%duran%' or "Name" like '%duran%' or "Surname" like '%duran%' or
"DisplayName" like '%duran%' or "Description" like '%duran%' or
"EmployeeId" like '%duran%' or exists ( select from unnest("StudentId") AS
value WHERE value LIKE '%duran%') or "ActivityId" like '%duran%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on "AccountActivityLogEntries" (cost=0.00..13464.03 rows=31210
width=5251) (actual time=18.466..2338.605 rows=102.00 loops=1)
Filter: ((("Email")::text ~~ '%duran%'::text) OR (("LdapUid")::text ~~
'%duran%'::text) OR (("Identity")::text ~~ '%duran%'::text) OR
(("Name")::text ~~ '%duran%'::text) OR (("Surname")::text ~~
'%duran%'::text) OR ((((("Name")::text || ' '::text) ||
("Surname")::text))::text ~~ '%duran%'::text) OR (("Description")::text ~~
'%duran%'::text) OR (("EmployeeId")::text ~~ '%duran%'::text) OR
EXISTS(SubPlan 1) OR (("ActivityId")::text ~~ '%duran%'::text))
Rows Removed by Filter: 61438
Buffers: shared hit=2922
SubPlan 1
-> Function Scan on unnest value (cost=0.00..0.13 rows=1 width=0)
(actual time=0.001..0.001 rows=0.00 loops=61438)
Filter: (value ~~ '%duran%'::text)
Rows Removed by Filter: 1
Planning:
Buffers: shared hit=4
Planning Time: 0.176 ms
Execution Time: 2338.670 ms
As you can see, setting deterministic false to the same ICU collation and
querying with "LIKE %search_term%" is way slower than using lower()
function. So why is it this slow? Is there any technical documentation that
explains this behaviour?
Regards,
Rahman Duran