Hi Greg, I am more or less done with the improvements of my GIST code so here is a description of my approach as promised. I basically started with the GIST "skeleton" functions shown in the PostgreSQL documentation and added more things step by step. The first version that I described in my blog used all the OpenEye graphsim functions to calculate similarities etc. in order to keep the code as simple as possible. The biggest difference between my code and the RDKit GIST is that I use the tanimoto distance (or Soergel distance) instead of the Hamming distance in the penalty function. The latter performed pretty badly, that's why I tried different distance metrics (e.g. mean hamming distance etc.) and the Soergel one turned out to be the best for the strategies I implemented. Obviously, it works better for tanimoto/dice and tversky than cosine/euclidean/manhattan but I use the former much more often. The picksplit function I use is much simpler than the RDKit one as well; it simply takes the GIST entry vector, finds the two most dissimilar and splits the entries depending on their similarity to them. The timings shown on my blog were recorded with that first version.
Since then I tried different distance measures in my penalty function but have not found a better one yet. What made a big difference though was to move the similarity calculation to C thereby avoiding the recreation of the OEFingerPrint objects. I also made some changes to the upper bound calculations that occur in case of internal GIST nodes (I think I had a bug in the first version). Trying to make this as fast as possible, I looked on Google for some crazy bit hacks and eventually found a SSE4-based implemention written by Imran Haque (http://pubs.acs.org/doi/abs/10.1021/ci200235e) and modified it slightly to fit my needs. I have written a new test script (using sqlalchemy) to time the queries more exhaustively; the script selects 100 random smiles strings from ChEMBL (see the code on my blog) and does a KNN-GIST based search against ChEMBL11 for various strategies. As a result, running this script I get an average performance of between 4 and 5 seconds to do all 100 random queries (Tanimoto) or including printing the result tuples. In the worst case, i.e. using the Cosine metric, I get between 15-20 seconds for 100 random queries. My testing machine is I7-920 with postgres on virtualbox. The POPCNT-based versions are 4-5x faster than the graphsim built-in distance metrics (querying all of ChEMBL without index), which are already quite fast. This means of course that my cartridge is not very portable (nehalem upwards) but for me that's not a problem since I implemented it for practice purposes. You might not have this luxury though with RDKit. I still think that I could improve the timings even further with a better picksplit function - to be honest, mine is not that great - a linear scan with any metric and smiles string takes less than 0.5s with the SSE4 functions. Here is an example with a random smiles string: cryst=# SELECT c.molregno, c.smiles, sq.similarity cryst-# FROM chembl c, cryst-# ( cryst(# SELECT molregno, fp % make_circular_fp('COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6cn5') as similarity cryst(# FROM fps cryst(# ORDER BY 2 DESC cryst(# LIMIT 10 cryst(# ) sq cryst-# where c.molregno = sq.molregno; molregno | smiles | similarity ----------+----------------------------------------------------------------------+------------ 335431 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6cn5 | 1 335330 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6n5 | 0.765625 335394 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cnc6ccccc6n5 | 0.753846 321365 | COc1ccc2c(c1)CC[C@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5 | 0.75 470091 | COc1ccc2c(c1)CC[C@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5 | 0.75 321561 | COc1ccc2c(c1)CC[C@@H]3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5 | 0.75 335522 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5 | 0.75 335399 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cc6ccccc6nc5 | 0.742424 553126 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5ccc6ccccc6c5.[Cl-] | 0.738462 335797 | COc1ccc2c(c1)CCC3N2CC[NH+](C3)CCC4CCC(CC4)NC(=O)c5cccc6c5nccc6 | 0.691176 (10 rows) Time: 401.118 ms cryst=# Or with the molecule that took 2.7s in my first version: cryst=# SELECT c.molregno, c.smiles, sq.similarity cryst-# FROM chembl c, cryst-# ( cryst(# SELECT molregno, fp % make_circular_fp('c1ccc(cc1)c2cccc(c2)O') as similarity cryst(# FROM fps cryst(# ORDER BY 2 DESC cryst(# LIMIT 10 cryst(# ) sq cryst-# where c.molregno = sq.molregno; molregno | smiles | similarity ----------+----------------------------------------------+------------ 203542 | c1ccc(cc1)c2cccc(c2)O | 1 513094 | c1cc(cc(c1)O)c2ccc(cc2)c3cccc(c3)O | 0.8125 107248 | c1cc(cc(c1)c2cccc(c2)O)c3cccc(c3)O | 0.75 513095 | c1cc(cc(c1)O)c2ccc(cc2)c3ccc(cc3)O | 0.722222 513093 | c1cc(cc(c1)c2cccc(c2)O)c3ccc(cc3)O | 0.684211 625818 | c1ccc(cc1)c2cc(nc(c2)c3cccc(c3)O)c4cccc(c4)O | 0.666667 625897 | c1ccc(cc1)c2cc(cc(n2)c3ccccc3)c4cccc(c4)O | 0.666667 203018 | c1cc(cc(c1)O)c2ccc(cc2)F | 0.65 625712 | c1ccc(cc1)c2cc(nc(c2)c3cccc(c3)O)c4ccccc4 | 0.636364 512953 | c1ccc(cc1)c2ccc(s2)c3cccc(c3)O | 0.619048 (10 rows) Time: 408.728 ms cryst=# Hope that helps, Adrian ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss