Re: [Rdkit-discuss] Speeding up database queries...

Adrian Schreyer Wed, 04 May 2011 00:32:22 -0700

On Wed, May 4, 2011 at 04:26, Greg Landrum <greg.land...@gmail.com> wrote:
> Dear JP,
>
> On Tue, May 3, 2011 at 6:29 PM, JP <jeanpaul.ebe...@inhibox.com> wrote:
>>
>> Based on an rdkit post I read over the warm weekend I set myself to
>> have a look at my rdkit based queries (and ways to speed them up)...
>>
>> But first some details:
>>
>> Postgres:
>> PostgreSQL 9.0.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc
>> (GCC) 4.5.2, 64-bit
>>
>> RDKit (DB Cartridge):
>> v. 0.20.0
>>
>
> The other important question is how much memory you have and what
> filesystem postgres is using for the database (local or network).
>
>> # of Molecules
>> 8,432,896
>>
>> Database table (ligand)
>>                                  Table "public.ligand"
>>   Column   |         Type          |                      Modifiers
>> ------------+-----------------------+-----------------------------------------------------
>>  id         | integer               | not null default
>> nextval('ligand_id_seq'::regclass)
>>  supplierid | character varying(50) |
>>  smiles     | text                  |
>>  rdkitmol   | mol                   |
>>  pairbv     | bfp                   |
>>  torsionbv  | bfp                   |
>>  morganbv   | bfp                   |
>>  amw        | real                  |
>>  mollogp    | real                  |
>>  hba        | integer               |
>>  hbd        | integer               |
>>  atoms      | integer               |
>>  hvyatoms   | integer               |
>> Indexes:
>>    "ligand_pkey" PRIMARY KEY, btree (id)
>>    "idx_ligand_morganbv" gist (morganbv)
>>    "idx_ligand_pairbv" gist (pairbv)
>>    "idx_ligand_rdkitmol" gist (rdkitmol)
>>    "idx_ligand_torsionbv" gist (torsionbv)
>>
>>
>> I cannot explain why the following queries:
>>
>> db=# select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
>>  count
>> -------
>>  2942
>> (1 row)
>>
>> Time: 193973.253 ms
>> db=# select count(*) from ligand where 
>> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
>>  count
>> -------
>>     8
>> (1 row)
>>
>> Time: 400138.989 ms
>
> The performance is critically dependent on whether or not the indices
> are in memory. If you've just started the database or if there's been
> a lot of non-postgres activity on the machine since the last time you
> used it, it can take a long time to load the index from disk to
> memory. Once it has been loaded, things should go faster.
>
> My emolecules database requires about 1 GB for each of the indices:
>
> emolecules=# \di+ molidx;
>                         List of relations
>  Schema |  Name  | Type  |  Owner   | Table |  Size   | Description
> --------+--------+-------+----------+-------+---------+-------------
>  public | molidx | index | glandrum | mols  | 1049 MB |
> (1 row)
> emolecules=# \di+ mfp2idx;
>                         List of relations
>  Schema |  Name   | Type  |  Owner   | Table |  Size  | Description
> --------+---------+-------+----------+-------+--------+-------------
>  public | mfp2idx | index | glandrum | fps   | 970 MB |
> (1 row)
>
>
>>
>> Take so long... these are orders of magnitude larger than timings
>> reported in http://code.google.com/p/rdkit/wiki/DatabaseCreation2
>> And my database in "only" roughly 50% larger (8M instead of the puny
>> 5M in emolecules).
>>
>> When I do an "explain" on these queries (to make sure the indices are
>> being used), I get:
>>
>> db=# explain select count(*) from ligand where rdkitmol@>'c1cccc2c1nncc2' ;
>>                                          QUERY PLAN
>> ----------------------------------------------------------------------------------------------
>>  Aggregate  (cost=34850.44..34850.45 rows=1 width=0)
>>   ->  Bitmap Heap Scan on ligand  (cost=2667.36..34829.36 rows=8433 width=0)
>>         Recheck Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
>>         ->  Bitmap Index Scan on idx_ligand_rdkitmol
>> (cost=0.00..2665.25 rows=8433 width=0)
>>               Index Cond: (rdkitmol @> 'c1cc2c(nncc2)cc1'::mol)
>> (5 rows)
>>
>> db=# explain select count(*) from ligand where
>> morganbv%morganbv_fp('c1cccc2c1nncc2',2);
>>                                                     QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------------------
>>  Aggregate  (cost=33290.88..33290.89 rows=1 width=0)
>>   ->  Bitmap Heap Scan on ligand  (cost=918.05..33269.79 rows=8433 width=0)
>>         Recheck Cond: (morganbv %
>> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
>>         ->  Bitmap Index Scan on idx_ligand_morganbv
>> (cost=0.00..915.94 rows=8433 width=0)
>>               Index Cond: (morganbv %
>> '\\xe0ffffff00040000130000007e00108444d20e3c40042af90238d0080a0c3462c2'::bfp)
>> (5 rows)
>>
>> Looks good no?
>> Am I missing something?  Or is this the fastest my search can go at?
>> Supposedly the fingerprints search is just doing some ~8M binary
>> operations no?  Why does this take so long?
>> Ideas, anyone?
>
> This all looks fine.
>
> The experiments to try are:
> 1) do a second query and see if that's faster
> 2) try a smaller table (start with 1 or 2 million) and see how that performs


3) Split the table vertically and create a separate table for the fingerprints
4) Use horizontal sharding and partition the fingerprints table

Adrian

> Best Regards,
> -greg
>
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network
> management toolset available today.  Delivers lowest initial
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Speeding up database queries...

Reply via email to