Hi Jules,
thank you for this detailed explanations. I have some additional questions:
- Aromaticity flag: so you store a boolean value yes/No? What's the advantage
of this? or otherwise said should that not already be covered by the
fingerprint?
- CDKHueckelAromaticityDetector: does this modify the passed in IAtomContainer
or just returns true/false? (eg. in smsd.Isomorphism.init method set flag for
cleanAndConfigureMolecule to false if it was already done after reading the
molecule?
- About loading from molfiles: is it much slower than loading a serialized
molecule? Assuming the molfile is in the database.
I'm already working on a method on putting the molecules in a BlockingQueue,
e.g. have one thread read from the database and 1 (or more) others doing the
subgraph matching. Like that I could limit the amount of IAtomContainers in
memory. memory usage is reduced dramatically but still have issues (=test
freezes/blocks and does not return).
I agree that I need some additional steps but the example with benzene
mentioned remains. If the query fragment is very common, any additional steps
won't help because there actually are so many structures that match. My idea is
to add a table with such fragments and a table with Molecules that contain that
fragment. The fragment can be stored as canonical smiles or InchiKey (or an
other canonical form). The first step could then be to check if the query is
such a fragment and select all the matching Id's (Select with 1 join). So
subgraph matching can be avoided. This could be enhanced by automatically
adding query fragments that return to many hits ("the system learns").
It could also be done otherwise liek do fingerprint screen first and only do
this "fragment" screen if there are too many hits after fingerprinting.
A simpler approach would be to just always only return a defined number of hits
like max 500 and ask the user to more clearly define the query structure.
Thanks for your help,
Thomas
From: j.kerssemak...@cmbi.ru.nl
Date: Wed, 1 Dec 2010 12:10:28 +0100
Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
Isomorphism Class
To: beginn...@hotmail.de
CC: cdk-user@lists.sourceforge.net
Hello All,
We've been using the CDK substructure search for a while now too in our biometa
database (http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?molecules%20jme).
Here is what we do to keep things manageable performanace-wise:
* Pre-calculate several statistics for all entries, namely:
- fingerprints
- an aromaticity flag
- amount of different elements
- number of rings (either aromatic or non-aromatic)
- total amount of atoms
- total amount of of atoms per element (in separate table of (molecule_id,
element_number, element_count))
* when querying, we calculate the same properties for the search-molecule and
then write a pretty long SQL-query that limits the results as much as possible
as cheaply as possible:
- first condition is the aromatic/non-aromatic flag (single flag comparison
--> cheapest you'll ever find)
- next condition is element-count, (a simple, cheap numerical '>='-comparison)
- then ringcount >=
- then the fingerprint comparison. We let the database do the logical AND and
==, because postgres has native bit-array operations, which our python-binding
(pgsql) can't handle because it doesn't understand bit-arrays (it converts them
to strings).
- finally a per-element atom-count comparison. This is last because in the
current set-up it has to join the element-count table to the molecule table,
which is probably slower.
* The mol-files for the resultant mol_id's are then concatenated into a large
sdf-file which is fed through a CDK SDFSubstructureFinder from a very old 2006
SVN version (dropped since, but I dare not replace it because everything will
probably implode if I try).
It's all a bit hack-ish, but it works fairly well since we don't have much
traffic. I never actually tested how the postgres query planner handles this,
but it is fairly smart.
Things that I would change if I rewrote it all today would be:
1) store serialized IAtomContainers in the database, to prevent having to
re-read the molfiles from disk every time
2) Use a faster substructure matcher (SMSD sounds good)
3) get rid of the element-count table, and rather add columns for our
most-prevalent atom-types to the molecule rows (namely: C,N,O,H,P). This avoids
the overhead of the join.
Thomas, as Nina already mentioned, you shouldn't load all the molecules before
the loop, rather load them one-by-one IN the loop. This means you only ever
need one molecule in-memory per query, which saves MASSIVELY on memory
requirements.
//Pseudocode example of what you have:
keyList = doTheSearch(); // get molecule ID's for potential candidates
map<key, molecule> theMap = manager.getMolecules(keyList); // This loads ALL
your molecules into memory, OUCH!!
for (key in map.keyset) {
mol = theMap.get(key);
if (searchtarget.isSubgraphOf(mol)) {
results.add(mol)
}
}
// Pseudocode for a more efficient way to do it (don't pre-load all molecules):
keyList = doTheSearch(); // get molecule ID's for potential candidates
for (key in keyList) {
mol = manager.getMolecule(key); // IMPROVEMENT: only load one molecule at a
time!
if (searchtarget.isSubgraphOf(mol)) {
results.add(mol)
}
}
Hope this helps,
Best regards,
Jules Kerssemakers
On 30 November 2010 21:19, Nina Jeliazkova <jeliazkova.n...@gmail.com> wrote:
Hi Thomas,
On 30 November 2010 21:58, Thomas Strunz <beginn...@hotmail.de> wrote:
Hi Nina,
I sure have more than 1 IAtomContainer in memory at time so I agree that might
be an issue but if screening lets say returns 1000 hits, 1000 subgraph matches
must be done and hence all the 1000 Molecules must be created first. So you
would suggest to read each one separatley from database after a subgraph match
returns?
What we are doing is getting database structure identifiers from prescreening
and reading structures one by one for subgraph matching. Few thousand of
IAtomContainers is fine for desktop application, but server side one could have
multiple queries at the same time and multiply the thousands to unreasonable
number.
A second issue is, if the query Molecule is a common fragment in the database,
let's assume benzene, and llike 80% of the fingerprints match, how do you
handle that and keep performance? subgraph matches on so mnay structures will
no perfrom well. How can you prevent that with very common substructures?
We have several levels of prescreening, fingerprints only are not sufficient
for reasonable performance. Also precalculated aromaticity flags to avoid
calculating that on the fly and caching of the final results. You could get an
overview from this poster from QSAR2010
http://www.ideaconsult.net/downloads/rhodes/posters/SMARTS.pdf .
Regards,
Nina
Regards,
Thomas
Just my two cents.
Besides prescreening, having minimum IAtomContainer objects in memory is the
key to performance. As less than one object doesn't make sense :) one
IATomContainer at a time is the best. Fingerprints can be pre-calculated and
no need to be loaded in-memory at all, let SQL do the prescreening.
We've been doing similar things (CDK, relational database, no cartridges) in
ambit (ambit.sourceforge.net) for quite few years already. There is
downloadable standalone application and a servlet container application war
file (to run your own service), as well as a running OpenTox REST services for
substructure searching , e.g.
https://ambit.uni-plovdiv.bg:8443/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F]
http://apps.ideaconsult.net:8080/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F,I]
Regards,
Nina
Regards,
Thomas
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user