Hi there all (and Greg, since he is likely to answer this),

Based on an rdkit post I read over the warm weekend I set myself to
have a look at my rdkit based queries (and ways to speed them up)...

But first some details:

Postgres:
PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
(GCC) 4.5.2, 64-bit

RDKit (DB Cartridge):
v. 0.20.0

# of Molecules
8,432,896

Database table (ligand)
                                  Table "public.ligand"
   Column   |         Type          |                      Modifiers
------------+-----------------------+-----------------------------------------------------
 id         | integer               | not null default
nextval('ligand_id_seq'::regclass)
 supplierid | character varying(50) |
 smiles     | text                  |
 rdkitmol   | mol                   |
 pairbv     | bfp                   |
 torsionbv  | bfp                   |
 morganbv   | bfp                   |
 amw        | real                  |
 mollogp    | real                  |
 hba        | integer               |
 hbd        | integer               |
 atoms      | integer               |
 hvyatoms   | integer               |
Indexes:
    "ligand_pkey" PRIMARY KEY, btree (id)
    "idx_ligand_morganbv" gist (morganbv)
    "idx_ligand_pairbv" gist (pairbv)
    "idx_ligand_rdkitmol" gist (rdkitmol)
    "idx_ligand_torsionbv" gist (torsionbv)


I cannot explain why the following queries:

db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
 count
-------
  2942
(1 row)

Time: 193973.253 ms
db=# select count(*) from ligand where morganbv%morganbv_fp('c1cccc2c1nncc2',2);
 count
-------
     8
(1 row)

Time: 400138.989 ms

Take so long... these are orders of magnitude larger than timings
reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
And my database in "only" roughly 50% larger (8M instead of the puny
5M in emolecules).

When I do an "explain" on these queries (to make sure the indices are
being used), I get:

db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Aggregate  (cost=34850.44..34850.45 rows=1 width=0)
   ->  Bitmap Heap Scan on ligand  (cost=2667.36..34829.36 rows=8433 width=0)
         Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
         ->  Bitmap Index Scan on idx_ligand_rdkitmol
(cost=0.00..2665.25 rows=8433 width=0)
               Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
(5 rows)

db=# explain select count(*) from ligand where
morganbv%morganbv_fp('c1cccc2c1nncc2',2);
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=33290.88..33290.89 rows=1 width=0)
   ->  Bitmap Heap Scan on ligand  (cost=918.05..33269.79 rows=8433 width=0)
         Recheck Cond: (morganbv %
'\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
         ->  Bitmap Index Scan on idx_ligand_morganbv
(cost=0.00..915.94 rows=8433 width=0)
               Index Cond: (morganbv %
'\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
(5 rows)

Looks good no?
Am I missing something?  Or is this the fastest my search can go at?
Supposedly the fingerprints search is just doing some ~8M binary
operations no?  Why does this take so long?
Ideas, anyone?

Many Thanks
JP

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to