Re: [PERFORM] substring index what is better way to query

Tieson Molly Tue, 20 Jun 2017 07:42:29 -0700

Tom,
is there a different construct than the Similar To that would work?

I know for certain that the first few characters could be different due to
the nature of geohashes.  So I may not be able to optimize the prefix
aspect in some cases.


Best regards,

Ty

On Jun 20, 2017 10:19 AM, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

> Tieson Molly <tvm...@gmail.com> writes:
> > I have an index I created on the prefix of a column:
>
> > create index location_geo_idx ON locations( substring(geohash, 0, 5));
>
> > I currently use a similar to query, but I wanted to know if there is a
> > faster way to query multiple value using this index than this?
>
> > select l.geohash from locations l where l.geohash similar to
> '(dr7g|dr7e)%';
>
> Well, you've got a couple of problems there.  The most basic one is that
> that index doesn't match that query at all.  You need to arrange things
> so that the lefthand side of the SIMILAR TO operator is exactly the
> indexed value, not something that's related to it.  (Yes, in principle
> that index could be used to answer this query, but it would require a
> great deal more intimate knowledge than the planner has about the
> semantics of both substring() and SIMILAR TO.)  IOW, you need to write
>
> select l.geohash from locations l
>   where substring(l.geohash, 0, 5) similar to '(dr7g|dr7e)%';
>
> The other possible solution would be to just index the geohash strings
> verbatim; unless they are quite long, that's what I'd recommend, usually.
>
> Secondly, if you're using a non-C locale, you're likely not getting an
> indexscan plan anyway; check it with EXPLAIN.  To get an indexed prefix
> search out of a pattern match, the index has to use C sorting rules,
> which you can force with a COLLATE or text_pattern_ops option if the
> database's prevailing locale isn't C.
>
> Thirdly, if you experiment with EXPLAIN a little bit, you'll soon realize
> that the planner is not great at extracting common prefix strings out of
> OR'd pattern branches:
>
> regression=# create table loc (f1 text unique);
> CREATE TABLE
> regression=# explain select * from loc where f1 similar to '(dr7g|dr7e)%';
>                                QUERY PLAN
> -------------------------------------------------------------------------
>  Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)
>    Filter: (f1 ~ '^(?:(?:dr7g|dr7e).*)$'::text)
>    ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
>          Index Cond: ((f1 >= 'd'::text) AND (f1 < 'e'::text))
> (4 rows)
>
> The useful part of this for speed purposes is the "Index Cond", and
> you can see that it's only enforcing that the first character be "d".
> I don't remember that code very well at the moment, but I'm a bit
> surprised that it's even figured out that the "d" is common to both
> branches.  You can get a lot more traction if you factor the common
> prefix manually:
>
> regression=# explain select * from loc where f1 similar to 'dr7(g|e)%';
>                                QUERY PLAN
> -------------------------------------------------------------------------
>  Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)
>    Filter: (f1 ~ '^(?:dr7(?:g|e).*)$'::text)
>    ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)
>          Index Cond: ((f1 >= 'dr7'::text) AND (f1 < 'dr8'::text))
> (4 rows)
>
> or maybe even
>
> regression=# explain select * from loc where f1 similar to 'dr7g%' or f1
> similar to 'dr7e%';
>                                   QUERY PLAN
> ------------------------------------------------------------
> -------------------
>  Bitmap Heap Scan on loc  (cost=8.45..19.04 rows=2 width=32)
>    Recheck Cond: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~
> '^(?:dr7e.*)$'::text))
>    Filter: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))
>    ->  BitmapOr  (cost=8.45..8.45 rows=14 width=0)
>          ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7
> width=0)
>                Index Cond: ((f1 >= 'dr7g'::text) AND (f1 < 'dr7h'::text))
>          ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7
> width=0)
>                Index Cond: ((f1 >= 'dr7e'::text) AND (f1 < 'dr7f'::text))
> (8 rows)
>
> Whether this is worth the trouble depends a lot on your data distribution,
> but any of them are probably better than the seqscan you're no doubt
> getting right now.
>
>                         regards, tom lane
>

Re: [PERFORM] substring index what is better way to query

Reply via email to