Re: [Rdkit-discuss] Speeding up database queries...

Greg Landrum Wed, 04 May 2011 21:41:35 -0700

Hi JP,

On Wed, May 4, 2011 at 10:49 AM, JP <jeanpaul.ebe...@inhibox.com> wrote:
> Hi there Adrian,
>
> Why should splitting the table vertically make a difference?
> Am I not correct in thinking that would then require a join, which is
> expensive (especially on 8M rows) ?


You would actually only be doing the join on the rows that match your
SSS/similarity query. As long as you have some kind of primary key and
build indices on that primary key the performance should be fine. This
is the layout I almost always use: molecules in one table, bit vect
fingerprints in a second, count-based fingerprints in a third.

> My FS is local, and my indices are quite large
>
> db=#  \di+ idx_ligand_rdkitmol;
>                               List of relations
>  Schema |        Name         | Type  | Owner | Table  |  Size   | Description
> --------+---------------------+-------+-------+--------+---------+-------------
>  public | idx_ligand_rdkitmol | index | jpebe | ligand | 5030 MB |
> (1 row)
>
> db=#  \di+ idx_ligand_morganbv;
>                               List of relations
>  Schema |        Name         | Type  | Owner | Table  |  Size   | Description
> --------+---------------------+-------+-------+--------+---------+-------------
>  public | idx_ligand_morganbv | index | jpebe | ligand | 1645 MB |
> (1 row)

yeah, those are huge.

> But Greg is right (!) running the query a second time resulted in much
> faster performance (11285.886ms as opposed to the original
> 193973.253ms)
> Of course if you change the smiles string, than nothing is cached  and
> it takes ages again...

I actually wasn't proposing re-running the query to get the cached
results, that's cheating. :-) The idea is that once you have the index
in memory all queries will go faster. This is what the emolecules
example on the wiki shows.

It sounds like you've reached a database size where doing some real
performance tuning on the database machine is going to be required. I
don't have much experience with this, but there does seem to be a fair
amount of information out there on the web. This page, in particular,
looks helpful in explaining what the configuration parameters are and
providing suggestions for tuning them:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
shared_buffers and effective_cache_size look particularly relevant.

This kind of performance tuning information would be really useful to
collect, so if you don't end up getting totally frustrated, please do
share your findings.

Best,
-greg

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Speeding up database queries...

Reply via email to