Hi, I'm using RDkit 2017.09 with Postgres 9.5. I loaded in ChEMBL23 using
the instructions given at http://www.rdkit.org/docs/Cartridge.html and set
up the indexes etc.

However, when I run a substructure query I, EXPLAIN ANALYZE output looks
like

chembl_23=> explain analyze select count(*) from rdk.mols where m@
>'C1CC2CCC3C(CCC4CCCCC34)C2C1';

                                                            QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=6564.86..6564.87 rows=1 width=0) (actual
time=10546.899..10546.899 rows=1 loops=1)

   ->  Bitmap Heap Scan on mols  (cost=369.80..6560.54 rows=1727 width=0)
(actual time=465.881..10539.663 rows=8170 loops=1)

         Recheck Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)

         Rows Removed by Index Recheck: 16131

         Heap Blocks: exact=9458

         ->  Bitmap Index Scan on molidx  (cost=0.00..369.37 rows=1727
width=0) (actual time=460.029..460.029 rows=24301 loops=1)

               Index Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)

 Planning time: 1.258 ms

 Execution time: 10548.293 ms

While it's using the GiST index, I note the big difference in the expected
(1727) and actual (24301) row counts for the Bitmap Index Scan node.

This seems to suggest that the index statistics are not accurate. Has
anybody noticed this? Would fiddling with planner settings (such
as default_statistics_target) be useful for this?

Interestingly, what ever query SMILES I put in, the expected row count for
the Bitmap Index Scan is always 1727. Is this by design?

(The other aspect I noted is that in the subsequent Bitmap Heap Scan, a
large number of rows are discarded. Since the heap pages pointed to by the
Bitmap Index Scan node have to be scanned completely, is it feasible for
structurally similar compounds to be colocated in a heap page? Or is this
beyond the scope of the GiST index?)

-- 
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to