Hi there,
TL;DR Some queries using the index are slower than queries not using it.
I have been looking (without much success until now) at improving the speed
of a postgresql database with rdkit installed. First some version
preliminaries:
Postgresql: 9.3.4
RDKit postgresql cartridge version: 0.72.0 (via select rdkit_version();)
My database has only one simple table "molecule" (described below), and
this was created following the instructions in
http://www.rdkit.org/docs/Cartridge.html religiously.
moldb-# \d molecule
Table "public.molecule"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
mol | mol |
Indexes:
"molidx" gist (mol)
This tables contains 15,352,756 molecules.
Now, when I make a substructure search which returns a good number of rows
(e.g. aromatic ring) the query takes "forever" (137 minutes, more than what
Germany employed to get the better of Messi & Co. yesterday).
moldb=# select id, mol from molecule where mol@>'c1ccccc1';
Time: 8260085.134 ms
The query plan uses two queries (nested) as can be seen by:
moldb=# EXPLAIN (ANALYZE, BUFFERS) select id, mol from molecule where mol@
>'c1ccccc1';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on molecule (cost=3439.42..57845.93 rows=15355
width=348) (actual time=67358.636..1408828.792 rows=12555041 loops=1)
Recheck Cond: (mol @> 'c1ccccc1'::mol)
Rows Removed by Index Recheck: 70657
Buffers: shared hit=33551 read=1516077
-> Bitmap Index Scan on molidx (cost=0.00..3435.58 rows=15355 width=0)
(actual time=66891.010..66891.010 rows=12625698 loops=1)
Index Cond: (mol @> 'c1ccccc1'::mol)
Buffers: shared read=825325
>From the postgresql documentation of similar output:
"Here the planner has decided to use a two-step plan: the child plan node
visits an index to find the locations of rows matching the index condition,
and then the upper plan node actually fetches those rows from the table
itself. Fetching rows separately is much more expensive than reading them
sequentially, but because not all the pages of the table have to be
visited, this is still cheaper than a sequential scan. (The reason for
using two plan levels is that the upper plan node sorts the row locations
identified by the index into physical order before reading them, to
minimize the cost of separate fetches. The "bitmap" mentioned in the node
names is the mechanism that does the sorting.)"
But random access is more expensive than sequential IO. So my next thought
was to disable the use of the index in the query. A temp. way to do this
is via these commands (note that this should only be done for debugging
purposes and not on live systems):
moldb=# set enable_hashagg=off;
moldb=# set enable_hashjoin=off;
moldb=# set enable_indexscan=off;
moldb=# set enable_mergejoin=off;
moldb=# set enable_nestloop=off;
moldb=# set enable_tidscan=off;
moldb=# set enable_sort=off;
In fact my explain result now looks like (happily bypassing the index):
moldb=# explain select id, mol from molecule where mol@>'c1ccccc1';
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on molecule (cost=0.00..923708.29 rows=15355 width=348)
Filter: (mol @> 'c1ccccc1'::mol)
(2 rows)
And running this query now takes almost half the time, 81 minutes (still a
draw between the two sides).
moldb=# select id, mol from molecule where mol@>'c1ccccc1';
Time: 4888271.991 ms
Note that a query which returns a few molecules, returns in super quick
time (<1s):
moldb=# select * from molecule where mol@>'c1cccc2c1nncc2'; -- there are
~6,000 of these
Time: 4476.669 ms
But substructures which are found more often, return more slowly.
moldb=# select count(*) from molecule where mol@>'c1ccccc1C(=O)OC';
count
--------
422439
(1 row)
Time: 66134.405 ms
I wonder if there is anyone who has looked at this/has similar experience
and has some answers to the following questions:
- Am I doing anything obviously wrong? / Is there an easy fix to speed
these queries up? (without using limit)
- Is there a way for RDKit to suggest to Postgresql to NOT use the index in
certain cases?
- Which RDKit code should I be looking at to try to optimize this? More
specifically, even looking for molecules with a carbon atom 'C' triggers a
search using the index, even if all my molecules in the database have a
carbon atom (so a sequential search would make more sense here).
- I have tried to play around with the table statistics - alter table
molecule alter mol set statistics 1000; - this should be the thing which
suggests to the query planner whether to use the index or not (based on a
sample of the table) - but I haven't noticed any difference. Does the
RDKit indexing technology support this?
Apologies for the long email and happy Monday to everyone!
JP
-
Jean-Paul Ebejer
Early Stage Researcher
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck®
Code Sight™ - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss