Hi James,
[fair warning before I start: we quickly hit the limits of my postgresql
expertise here]
On Thu, May 8, 2014 at 2:35 PM, James Davidson <[email protected]>wrote:
> Dear All,
>
>
>
> I have recently been spending a bit more time with the RDKit cartridge,
> and have what is probably a very naïve question…
>
> Having built some RDKit fingerprints for ChEMBL_18, I see the following
> behaviour (for clarification – ‘ecfp4_bv’ is the column in my rdk.fps table
> that has been generated using morganbv_fp(mol, 2)):
>
>
>
>
>
> chembl_18=# \timing on
>
> Timing is on.
>
>
>
> chembl_18=# set rdkit.tanimoto_threshold=0.5;
>
> SET
>
> Time: 0.167 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
> chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 2033.348 ms
>
>
>
> chembl_18=# select chembl_id from rdk.fps where tanimoto_sml(ecfp4_bv,
> morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
> chembl_id
>
> -------------
>
> CHEMBL15719
>
> (1 row)
>
>
>
> Time: 6843.605 ms
>
>
>
>
>
> I can see that the query plans are different in the two cases, but I don’t
> fully understand why – see below:
>
>
>
> *QUERY 1 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where ecfp4_bv %
> morganbv_fp('c1nnccc1'::mol,2);
>
>
> QUERY PLAN
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Bitmap Heap Scan on fps (cost=106.91..5298.31 rows=1352 width=13) (actual
> time=1774.986..1774.987 rows=1 loops=1)
>
> Recheck Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
> -> Bitmap Index Scan on fps_ecfp4bv_idx (cost=0.00..106.57 rows=1352
> width=0) (actual time=1774.969..1774.969 rows=1 loops=1)
>
> Index Cond: (ecfp4_bv %
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
>
> Total runtime: 1775.035 ms
>
> (5 rows)
>
>
>
> Time: 1776.133 ms
>
>
>
>
>
> *QUERY 2 (with explain analyze)*
>
> chembl_18=# explain analyze select chembl_id from rdk.fps where
> tanimoto_sml(ecfp4_bv, morganbv_fp('c1nnccc1'::mol, 2)) > 0.5;
>
>
> QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Seq Scan on fps (cost=0.00..388808.17 rows=450793 width=13) (actual
> time=1278.115..6953.977 rows=1 loops=1)
>
> Filter: (tanimoto_sml(ecfp4_bv,
> '\x00000000000000000100000000000000080000000000000000000000000000000000004200000000000482000000000400000000000000000000000000000000'::bfp)
> > 0.5::double precision)
>
> Rows Removed by Filter: 1352377
>
> Total runtime: 6954.010 ms
>
> (4 rows)
>
>
>
> Time: 6955.103 ms
>
What these are telling you is that the second query is not using the index:
it's a sequential scan, so it has to test all rows of the database. This
happens because the index is defined for the operator %, but not for the
function tanimoto_sml(). There may be an approach to get the index set up
using that function, but there we reach the limits of my expertise.
> It seems conceptually ‘easier’ to add the similarity value as part of
> the query, rather than setting it as a variable ahead of the query; but
> clearly I should be doing it the latter way for performance reasons. So
> even if I don’t fully understand why at the moment, am I correct in
> thinking that queries of this sort should always be run with the similarity
> operators (%, #)? And if so, is the rdkit.tanimoto_threshold variable set
> at the level of the session, the user, or the database?
>
>
It's set at the session level.
When doing similarity searches, I find it generally helpful to also include
the <%> operator in an "order by" clause so that the results come back in
sorted order.
So instead of this;
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
molregno
----------
412312
412302
412310
441378
470082
773946
775269
911501
1015485
1034321
1040255
1040496
1042958
1043871
1044892
1045663
1047691
1049393
(18 rows)
Time: 1042.310 ms
I do this:
chembl_17=# select molregno from rdk.fps where mfp2 %
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1')
order by
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2;
molregno
----------
412312
470082
1040255
773946
1044892
1049393
1040496
441378
1047691
1042958
412302
1043871
412310
1045663
911501
775269
1015485
1034321
(18 rows)
Time: 1032.266 ms
Notice that this doesn't make things any slower.
It's nice to see the actual similarity values:
chembl_17=# select
molregno,tanimoto_sml(morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1'),mfp2)
from rdk.fps
where mfp2 % morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1')
order by
morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%> mfp2;
molregno | tanimoto_sml
----------+-------------------
412312 | 0.692307692307692
470082 | 0.583333333333333
1040255 | 0.571428571428571
773946 | 0.549019607843137
1044892 | 0.518518518518518
1049393 | 0.517857142857143
1040496 | 0.517857142857143
441378 | 0.510204081632653
1047691 | 0.509090909090909
1042958 | 0.509090909090909
412302 | 0.5
1043871 | 0.5
412310 | 0.5
1045663 | 0.5
911501 | 0.5
775269 | 0.5
1015485 | 0.5
1034321 | 0.5
(18 rows)
Time: 1034.257 ms
This also doesn't slow things down, but it is now pretty repetitive, so I
typically define a function for it:
chembl_17=# create or replace function do_mfp2_search(text)
chembl_17-# returns table(molregno integer, similarity double
precision) as
chembl_17-# $$
chembl_17$# select molregno,tanimoto_sml(morganbv_fp($1::mol),mfp2) as
similarity
chembl_17$# from rdk.fps
chembl_17$# where morganbv_fp($1::mol)%mfp2
chembl_17$# order by morganbv_fp($1::mol)<%>mfp2;
chembl_17$# $$ language sql volatile ;
CREATE FUNCTION
Time: 0.926 ms
chembl_17=# select * from do_mfp2_search('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
molregno | similarity
----------+-------------------
412312 | 0.692307692307692
470082 | 0.583333333333333
1040255 | 0.571428571428571
773946 | 0.549019607843137
1044892 | 0.518518518518518
1049393 | 0.517857142857143
1040496 | 0.517857142857143
441378 | 0.510204081632653
1047691 | 0.509090909090909
1042958 | 0.509090909090909
412302 | 0.5
1043871 | 0.5
412310 | 0.5
1045663 | 0.5
911501 | 0.5
775269 | 0.5
1015485 | 0.5
1034321 | 0.5
(18 rows)
Time: 1061.676 ms
Another feature that can be quite useful is a nearest-neighbor search;
which can be used to find the N nearest-neighbors while ignoring the
similarity threshold. Here's an example that starts by setting the
threshold higher and then demonstrates what a neighbor search returns:
chembl_17=# set rdkit.tanimoto_threshold = 0.6;
SET
Time: 0.214 ms
chembl_17=# select * from do_mfp2_search('Cc1ccc2nc(N(C)CC(=O)O)sc2c1');
molregno | similarity
----------+-------------------
412312 | 0.692307692307692
(1 row)
Time: 1045.409 ms
chembl_17=# select
molregno,tanimoto_sml(morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1'),mfp2)
from rdk.fps
order by morganbv_fp('Cc1ccc2nc(N(C)CC(=O)O)sc2c1') <%>
mfp2 limit 5;
molregno | tanimoto_sml
----------+-------------------
412312 | 0.692307692307692
470082 | 0.583333333333333
1040255 | 0.571428571428571
773946 | 0.549019607843137
1044892 | 0.518518518518518
(5 rows)
Time: 1278.761 ms
One final advanced topic: if you are planning on making regular use of the
similarity features in the cartridge and are running on a linux system or
Mac I would recommend recompiling the cartridge with some optimizations for
tanimoto similarity. To do this, you need to edit the cartridge Makefile
from:
PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
${INCHIFLAGS} #-DUSE_BUILTIN_POPCOUNT -msse4.2
to:
PG_CPPFLAGS = -I${BOOSTHOME} -I${RDKIT}/Code -DRDKITVER='"007200"'
${INCHIFLAGS} -DUSE_BUILTIN_POPCOUNT -msse4.2
(I just removed a comment character here). This speeds the Tanimoto
calculation up a fair bit (it's still not nearly as fast as Andrew's
chemfp, but it's better than the default behavior).
Hope this helps,
-greg
------------------------------------------------------------------------------
Is your legacy SCM system holding you back? Join Perforce May 7 to find out:
• 3 signs your SCM is hindering your productivity
• Requirements for releasing software faster
• Expert tips and advice for migrating your SCM now
http://p.sf.net/sfu/perforce
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss