Re: [HACKERS] WIP: index support for regexp search

Heikki Linnakangas Thu, 29 Nov 2012 05:37:52 -0800

One thing that bothers me with this algoritm is that the overflowmechanism is all-or-nothing. In many cases, even when there is a hugenumber of states in the diagram, you could still extract at least a fewtrigrams that must be present in any matching string, with littleeffort. At least, it seems like that to a human :-).


For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');


you get a query plan like this (the long regexp string edited out):

Aggregate (cost=228148.02..228148.03 rows=1 width=0) (actualtime=131.100..131

.101 rows=1 loops=1)

-> Bitmap Heap Scan on azjunk4 (cost=228144.01..228148.02 rows=1width=0) (

actual time=131.096..131.096 rows=0 loops=1)
         Recheck Cond: (txt ~ <ridiculously long regexp>)
         Rows Removed by Index Recheck: 10000

-> Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx(cost=0.00..228144

.01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)
               Index Cond: (txt ~ <ridiculously long regexp>)
 Total runtime: 131.230 ms
(7 rows)

That ridiculously long string exceeds the number of states (I think,could be number of paths or arcs too), and the algorithm gives up,resorting to scanning the whole index as can be seen by the "RowsRemoved by Index Recheck" line. However, it's easy to see that anymatching string must contain *any* of the possible trigrams thealgorithm extracts. If it could safely return just a few of them, say"aab" and "abz", and discard the rest, that would already be much betterthan a full index scan.

Would it be safe to simply stop short the depth-first search onoverflow, and proceed with the graph that was constructed up to that point?


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

Reply via email to