Hi Rajarshi,

On Sat, Jan 6, 2018 at 6:32 AM, Rajarshi Guha <rajarshi.g...@gmail.com>

> Hi, I'm using RDkit 2017.09 with Postgres 9.5. I loaded in ChEMBL23 using
> the instructions given at http://www.rdkit.org/docs/Cartridge.html and
> set up the indexes etc.
> However, when I run a substructure query I, EXPLAIN ANALYZE output looks
> like
> chembl_23=> explain analyze select count(*) from rdk.mols where m@
>                                                             QUERY PLAN
> ------------------------------------------------------------
> ----------------------------------------------------------------------
>  Aggregate  (cost=6564.86..6564.87 rows=1 width=0) (actual
> time=10546.899..10546.899 rows=1 loops=1)
>    ->  Bitmap Heap Scan on mols  (cost=369.80..6560.54 rows=1727 width=0)
> (actual time=465.881..10539.663 rows=8170 loops=1)
>          Recheck Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)
>          Rows Removed by Index Recheck: 16131
>          Heap Blocks: exact=9458
>          ->  Bitmap Index Scan on molidx  (cost=0.00..369.37 rows=1727
> width=0) (actual time=460.029..460.029 rows=24301 loops=1)
>                Index Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)
>  Planning time: 1.258 ms
>  Execution time: 10548.293 ms
> While it's using the GiST index, I note the big difference in the expected
> (1727) and actual (24301) row counts for the Bitmap Index Scan node.
> This seems to suggest that the index statistics are not accurate. Has
> anybody noticed this? Would fiddling with planner settings (such
> as default_statistics_target) be useful for this?

Almost certainly

> Interestingly, what ever query SMILES I put in, the expected row count for
> the Bitmap Index Scan is always 1727. Is this by design?

"by design" makes it sound better thought out than it is... this is what
the code does. :-S

The cartridge is not well integrated with the query planner. I would assume
that this is a solvable problem, but it's not something I've ever managed
to set aside a big enough chunk of time to really dig into.

> (The other aspect I noted is that in the subsequent Bitmap Heap Scan, a
> large number of rows are discarded. Since the heap pages pointed to by the
> Bitmap Index Scan node have to be scanned completely, is it feasible for
> structurally similar compounds to be colocated in a heap page? Or is this
> beyond the scope of the GiST index?)

I believe that this is beyond what you can do with GiST, but I'm not an
expert here

Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
Rdkit-discuss mailing list

Reply via email to