Hi Greg,

I am more or less done with the improvements of my GIST code so here
is a description of my approach as promised. I basically started with
the GIST "skeleton" functions shown in the PostgreSQL documentation
and added more things step by step. The first version that I described
in my blog used all the OpenEye graphsim functions to calculate
similarities etc. in order to keep the code as simple as possible. The
biggest difference between my code and the RDKit GIST is that I use
the tanimoto distance (or Soergel distance) instead of the Hamming
distance in the penalty function. The latter performed pretty badly,
that's why I tried different distance metrics (e.g. mean hamming
distance etc.) and the Soergel one turned out to be the best for the
strategies I implemented. Obviously, it works better for tanimoto/dice
and tversky than cosine/euclidean/manhattan but I use the former much
more often. The picksplit function I use is much simpler than the
RDKit one as well; it simply takes the GIST entry vector, finds the
two most dissimilar and splits the entries depending on their
similarity to them. The timings shown on my blog were recorded with
that first version.

Since then I tried different distance measures in my penalty function
but have not found a better one yet. What made a big difference though
was to move the similarity calculation to C thereby avoiding the
recreation of the OEFingerPrint objects. I also made some changes to
the upper bound calculations that occur in case of internal GIST nodes
(I think I had a bug in the first version). Trying to make this as
fast as possible, I looked on Google for some crazy bit hacks and
eventually found a SSE4-based implemention written by Imran Haque
(http://pubs.acs.org/doi/abs/10.1021/ci200235e) and modified it
slightly to fit my needs. I have written a new test script (using
sqlalchemy) to time the queries more exhaustively; the script selects
100 random smiles strings from ChEMBL (see the code on my blog) and
does a KNN-GIST based search against ChEMBL11 for various strategies.
As a result, running this script I get an average performance of
between 4 and 5 seconds to do all 100 random queries (Tanimoto) or
including printing the result tuples. In the worst case, i.e. using
the Cosine metric, I get between 15-20 seconds for 100 random queries.
My testing machine is I7-920 with postgres on virtualbox. The
POPCNT-based versions are 4-5x faster than the graphsim built-in
distance metrics (querying all of ChEMBL without index), which are
already quite fast. This means of course that my cartridge is not very
portable (nehalem upwards) but for me that's not a problem since I
implemented it for practice purposes. You might not have this luxury
though with RDKit.

I still think that I could improve the timings even further with a
better picksplit function - to be honest, mine is not that great - a
linear scan with any metric and smiles string takes less than 0.5s
with the SSE4 functions. Here is an example with a random smiles
string:

cryst=# SELECT  c.molregno, c.smiles, sq.similarity
cryst-# FROM    chembl c,
cryst-#         (
cryst(#           SELECT molregno, fp %
make_circular_fp('COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6cn5')
as similarity
cryst(#             FROM fps
cryst(#         ORDER BY 2 DESC
cryst(#         LIMIT 10
cryst(#         ) sq
cryst-# where   c.molregno = sq.molregno;
 molregno |                                smiles
          | similarity
----------+----------------------------------------------------------------------+------------
   335431 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6cn5
      |          1
   335330 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6n5
      |   0.765625
   335394 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cnc6ccccc6n5
      |   0.753846
   321365 | COc1ccc2c(c1)CC[C@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5
  |       0.75
   470091 | COc1ccc2c(c1)CC[C@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5
  |       0.75
   321561 | COc1ccc2c(c1)CC[C@@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5
 |       0.75
   335522 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5
      |       0.75
   335399 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6nc5
      |   0.742424
   553126 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5.[Cl-]
|   0.738462
   335797 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cccc6c5nccc6
      |   0.691176
(10 rows)

Time: 401.118 ms
cryst=#

Or with the molecule that took 2.7s in my first version:

cryst=# SELECT  c.molregno, c.smiles, sq.similarity
cryst-# FROM    chembl c,
cryst-#         (
cryst(#           SELECT molregno, fp %
make_circular_fp('c1ccc(cc1)c2cccc(c2)O') as similarity
cryst(#             FROM fps
cryst(#         ORDER BY 2 DESC
cryst(#         LIMIT 10
cryst(#         ) sq
cryst-# where   c.molregno = sq.molregno;
 molregno |                    smiles                    | similarity
----------+----------------------------------------------+------------
   203542 | c1ccc(cc1)c2cccc(c2)O                        |          1
   513094 | c1cc(cc(c1)O)c2ccc(cc2)c3cccc(c3)O           |     0.8125
   107248 | c1cc(cc(c1)c2cccc(c2)O)c3cccc(c3)O           |       0.75
   513095 | c1cc(cc(c1)O)c2ccc(cc2)c3ccc(cc3)O           |   0.722222
   513093 | c1cc(cc(c1)c2cccc(c2)O)c3ccc(cc3)O           |   0.684211
   625818 | c1ccc(cc1)c2cc(nc(c2)c3cccc(c3)O)c4cccc(c4)O |   0.666667
   625897 | c1ccc(cc1)c2cc(cc(n2)c3ccccc3)c4cccc(c4)O    |   0.666667
   203018 | c1cc(cc(c1)O)c2ccc(cc2)F                     |       0.65
   625712 | c1ccc(cc1)c2cc(nc(c2)c3cccc(c3)O)c4ccccc4    |   0.636364
   512953 | c1ccc(cc1)c2ccc(s2)c3cccc(c3)O               |   0.619048
(10 rows)

Time: 408.728 ms
cryst=#

Hope that helps,

Adrian

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to