#32803: Not obvious interaction between Postgresql trgm index and icontains 
lookup
-------------------------------------+-------------------------------------
               Reporter:  Dmitri     |          Owner:  (none)
  Emelianov                          |
                   Type:             |         Status:  new
  Uncategorized                      |
              Component:             |        Version:  dev
  contrib.postgres                   |       Keywords:  postgresql
               Severity:  Normal     |  icontains lookup search
           Triage Stage:             |      Has patch:  0
  Unreviewed                         |
    Needs documentation:  0          |    Needs tests:  0
Patch needs improvement:  0          |  Easy pickings:  0
                  UI/UX:  0          |
-------------------------------------+-------------------------------------
 Hi! Long post ahead.

 tldr; standard lookup is not working with standard index as expected

 Using LIKE UPPER in icontains is not an accident, it is even stated that
 it's
 
[https://github.com/django/django/blob/ca9872905559026af82000e46cde6f7dedc897b6/django/db/backends/postgresql/operations.py#L102
 faster]

 {{{
 # Use UPPER(x) for case-insensitive lookups; it's faster.
 if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'):
     lookup = 'UPPER(%s)' % lookup
 }}}

 However, the main reason I've created the ticket is not the speed, but
 usage icontains together with trgm index.

 Let's create a simple test model:

 {{{
 class Book(models.Model):
     name = models.TextField()
 }}}

 and prefil it with kind of random data.

 Random function (idea taken from
 https://stackoverflow.com/a/5478750/1185696):
 {{{
 CREATE OR REPLACE FUNCTION get_random_name() RETURNS text as $$
 SELECT array_to_string(array
        (
               select
 substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
 trunc(random() * 62)::integer + 1, 1)
               FROM   generate_series(1, (random() * 15)::integer)), '') ||
 ' ' ||  array_to_string(array
        (
               select
 substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
 trunc(random() * 62)::integer + 1, 1)
               from generate_series(1, (random() * 15)::integer)), '');
 $$
 VOLATILE
 LANGUAGE SQL;

 -- Example output: M2Fib gJr4CHLnZHcr
 -- I want data to be kind of "real" if it is could be said about random
 string
 }}}

 ->
 {{{
 INSERT INTO book_book(name) SELECT get_random_name() FROM
 generate_series(1, 10000);
 }}}

 Now let's do some search.
 {{{
 Book.objects.filter(name__icontains='orc')

 # SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE
 UPPER("book_book"."name"::text) LIKE UPPER('%orc%') ORDER BY
 "book_book"."id" ASC
 # ->  Seq Scan on book_book  (cost=0.00..220.00 rows=80 width=24) (actual
 time=5.755..5.755 rows=0 loops=1)
 #        Filter: (upper(name) ~~ '%ORC%'::text)
 #        Rows Removed by Filter: 9996
 }}}

 Seq scan, totally understandable, no complains here. Now let's do 1m
 entries:
 {{{
 ->  Parallel Seq Scan on book_book  (cost=0.00..8865.51 rows=3098
 width=24) (actual time=2.024..341.389 rows=143 loops=3)
                Filter: (upper(name) ~~ '%ORC%'::text)
                Rows Removed by Filter: 369857
  Planning Time: 0.214 ms
  Execution Time: 374.926 ms
 }}}


 Still OK, but let's do better by adding indices. Since pg_trgm is very
 popular solution for text search, let's use it (I am not sure I know the
 alternatives aside from tsvector for such kind of queries):
 {{{
 indexes = [
             GinIndex(name='test_index', fields=['name'],
 opclasses=['gin_trgm_ops']),
         ]
 }}}
 Repeat query:
 {{{
 ->  Parallel Seq Scan on book_book  (cost=0.00..14641.50 rows=46 width=24)
 (actual time=331.265..331.266 rows=0 loops=3)
                Filter: (upper(name) ~~ '%ORC%'::text)
                Rows Removed by Filter: 370000
  Planning Time: 1.524 ms
  Execution Time: 368.272 ms
 }}}
 Looks like index is not working.

 Change query a bit:
 {{{
 EXPLAIN ANALYZE SELECT "book_book"."id", "book_book"."name" FROM
 "book_book" WHERE "book_book"."name"::text ILIKE '%orc%' ORDER BY
 "book_book"."id" ASC;
 }}}
 ->
 {{{
   ->  Bitmap Heap Scan on book_book  (cost=20.86..426.28 rows=111
 width=24) (actual time=0.158..1.441 rows=430 loops=1)
          Recheck Cond: (name ~~* '%orc%'::text)
          Heap Blocks: exact=414
          ->  Bitmap Index Scan on test_index  (cost=0.00..20.83 rows=111
 width=0) (actual time=0.075..0.076 rows=430 loops=1)
                Index Cond: (name ~~* '%orc%'::text)
  Planning Time: 0.178 ms
  Execution Time: 1.553 ms
 }}}


 I call "UPPPER" function usage as "implementation detail", because
 postgres have standard operator. And the main issue is that implementation
 detail adds additional complexity to index and you basically have to
 remember to use "upper" in all trgm indices. icontains is standard lookup.
 trgm index is kind of standard these days. I believe, one would expect
 pretty straightforward interaction between lookup and index (lookup should
 become faster). Current behavior, when default index is basically not
 working for default lookup is not obvious for me. This can be easily
 avoided just by using standard, out of the box operator.

 Now let's do some measures to check if it is faster or not. I will use
 same data generator, I will use "orc" query always, I will always pick
 average of 3 measurements. No indices applied.
 {{{
 PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu,
 compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
 +
 en_US.UTF-8 collation
 }}}

 Results:
 {{{
                          10k                     100k
 1m                      10m
 ILIKE                     9.0                     111
 629                     4247
 UPPER                     6.6                     72
 448                     3175
 LOWER                     7.1                     71
 462                     2962
 }}}

 https://stackoverflow.com/questions/20336665/lower-like-vs-ilike question
 shows similar difference, but notes that everything could be more
 complicated.

 Indead, in my environment, ILIKE shows lower performance and I have no
 explanation about that.

 However, I would call this results implementation detail as well, because
 the main question is using "standard" operators and predictable
 interactions rather than performance. If one is concerned about
 performance, it could use an index. I also understand, that starting from
 3.2 it could be solved by custom operator in Index, but how many people
 would expect that default won't work? And what about indices created
 before 3.2? Especially when docs clearly tells (with shady SQL equivalent
 term) that "icontains" will use ILIKE. I guess this either could be
 changed to ILIKE (with performance penalty and broken existing custom
 UPPER indices), either heavily documented - icontains lookup should
 clearly tells which query it will use and which index should be used. Also
 GinIndex/GistIndex could use a note about TEXT fields (that you probably
 want to use UPPER function for text).

-- 
Ticket URL: <https://code.djangoproject.com/ticket/32803>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/048.e0072997df951b89da4feb57cb6a3ab6%40djangoproject.com.

Reply via email to