Re: [Rdkit-discuss] Postgres indexing question

Greg Landrum Sun, 07 Jan 2018 06:49:17 -0800

Hi Rajarshi,

On Sat, Jan 6, 2018 at 6:32 AM, Rajarshi Guha <[email protected]>
wrote:


> Hi, I'm using RDkit 2017.09 with Postgres 9.5. I loaded in ChEMBL23 using
> the instructions given at http://www.rdkit.org/docs/Cartridge.html and
> set up the indexes etc.
>
> However, when I run a substructure query I, EXPLAIN ANALYZE output looks
> like
>
> chembl_23=> explain analyze select count(*) from rdk.mols where m@
> >'C1CC2CCC3C(CCC4CCCCC34)C2C1';
>
>                                                             QUERY PLAN
>
>
> ------------------------------------------------------------
> ----------------------------------------------------------------------
>
>  Aggregate  (cost=6564.86..6564.87 rows=1 width=0) (actual
> time=10546.899..10546.899 rows=1 loops=1)
>
>    ->  Bitmap Heap Scan on mols  (cost=369.80..6560.54 rows=1727 width=0)
> (actual time=465.881..10539.663 rows=8170 loops=1)
>
>          Recheck Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)
>
>          Rows Removed by Index Recheck: 16131
>
>          Heap Blocks: exact=9458
>
>          ->  Bitmap Index Scan on molidx  (cost=0.00..369.37 rows=1727
> width=0) (actual time=460.029..460.029 rows=24301 loops=1)
>
>                Index Cond: (m @> 'C1CCC2C(C1)CCC1C3CCCC3CCC21'::mol)
>
>  Planning time: 1.258 ms
>
>  Execution time: 10548.293 ms
>
> While it's using the GiST index, I note the big difference in the expected
> (1727) and actual (24301) row counts for the Bitmap Index Scan node.
>
> This seems to suggest that the index statistics are not accurate. Has
> anybody noticed this? Would fiddling with planner settings (such
> as default_statistics_target) be useful for this?
>

Almost certainly


> Interestingly, what ever query SMILES I put in, the expected row count for
> the Bitmap Index Scan is always 1727. Is this by design?
>

"by design" makes it sound better thought out than it is... this is what
the code does. :-S

The cartridge is not well integrated with the query planner. I would assume
that this is a solvable problem, but it's not something I've ever managed
to set aside a big enough chunk of time to really dig into.


> (The other aspect I noted is that in the subsequent Bitmap Heap Scan, a
> large number of rows are discarded. Since the heap pages pointed to by the
> Bitmap Index Scan node have to be scanned completely, is it feasible for
> structurally similar compounds to be colocated in a heap page? Or is this
> beyond the scope of the GiST index?)
>

I believe that this is beyond what you can do with GiST, but I'm not an
expert here

-greg

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Postgres indexing question

Reply via email to