Hello All,
We've been using the CDK substructure search for a while now too in our
biometa database (
http://cheminf.cmbi.ru.nl/cgi-bin/biometa/biometa.py?molecules%20jme).
Here is what we do to keep things manageable performanace-wise:
* Pre-calculate several statistics for all entries, namely:
- fingerprints
- an aromaticity flag
- amount of different elements
- number of rings (either aromatic or non-aromatic)
- total amount of atoms
- total amount of of atoms per element (in separate table of (molecule_id,
element_number, element_count))
* when querying, we calculate the same properties for the search-molecule
and then write a pretty long SQL-query that limits the results as much as
possible as cheaply as possible:
- first condition is the aromatic/non-aromatic flag (single flag comparison
--> cheapest you'll ever find)
- next condition is element-count, (a simple, cheap numerical
'>='-comparison)
- then ringcount >=
- then the fingerprint comparison. We let the database do the logical AND
and ==, because postgres has native bit-array operations, which our
python-binding (pgsql) can't handle because it doesn't understand bit-arrays
(it converts them to strings).
- finally a per-element atom-count comparison. This is last because in the
current set-up it has to join the element-count table to the molecule table,
which is probably slower.
* The mol-files for the resultant mol_id's are then concatenated into a
large sdf-file which is fed through a CDK SDFSubstructureFinder from a very
old 2006 SVN version (dropped since, but I dare not replace it because
everything will probably implode if I try).
It's all a bit hack-ish, but it works fairly well since we don't have much
traffic. I never actually tested how the postgres query planner handles
this, but it is fairly smart.
Things that I would change if I rewrote it all today would be:
1) store serialized IAtomContainers in the database, to prevent having to
re-read the molfiles from disk every time
2) Use a faster substructure matcher (SMSD sounds good)
3) get rid of the element-count table, and rather add columns for our
most-prevalent atom-types to the molecule rows (namely: C,N,O,H,P). This
avoids the overhead of the join.
Thomas, as Nina already mentioned, you shouldn't load all the molecules
before the loop, rather load them one-by-one IN the loop. This means you
only ever need one molecule in-memory per query, which saves MASSIVELY on
memory requirements.
//Pseudocode example of what you have:
keyList = doTheSearch(); // get molecule ID's for potential candidates
map<key, molecule> theMap = manager.getMolecules(keyList); // This loads
ALL your molecules into memory, OUCH!!
for (key in map.keyset) {
mol = theMap.get(key);
if (searchtarget.isSubgraphOf(mol)) {
results.add(mol)
}
}
// Pseudocode for a more efficient way to do it (don't pre-load all
molecules):
keyList = doTheSearch(); // get molecule ID's for potential candidates
for (key in keyList) {
mol = manager.getMolecule(key); // IMPROVEMENT: only load one molecule at
a time!
if (searchtarget.isSubgraphOf(mol)) {
results.add(mol)
}
}
Hope this helps,
Best regards,
Jules Kerssemakers
On 30 November 2010 21:19, Nina Jeliazkova <jeliazkova.n...@gmail.com>wrote:
> Hi Thomas,
>
> On 30 November 2010 21:58, Thomas Strunz <beginn...@hotmail.de> wrote:
>
>> Hi Nina,
>>
>> I sure have more than 1 IAtomContainer in memory at time so I agree that
>> might be an issue but if screening lets say returns 1000 hits, 1000 subgraph
>> matches must be done and hence all the 1000 Molecules must be created first.
>> So you would suggest to read each one separatley from database after a
>> subgraph match returns?
>>
>
> What we are doing is getting database structure identifiers from
> prescreening and reading structures one by one for subgraph matching. Few
> thousand of IAtomContainers is fine for desktop application, but server side
> one could have multiple queries at the same time and multiply the thousands
> to unreasonable number.
>
>
>> A second issue is, if the query Molecule is a common fragment in the
>> database, let's assume benzene, and llike 80% of the fingerprints match, how
>> do you handle that and keep performance? subgraph matches on so mnay
>> structures will no perfrom well. How can you prevent that with very common
>> substructures?
>>
>
> We have several levels of prescreening, fingerprints only are not
> sufficient for reasonable performance. Also precalculated aromaticity
> flags to avoid calculating that on the fly and caching of the final results.
> You could get an overview from this poster from QSAR2010
> http://www.ideaconsult.net/downloads/rhodes/posters/SMARTS.pdf .
>
> Regards,
> Nina
>
>
>>
>> Regards,
>>
>> Thomas
>>
>>
>>
>> Just my two cents.
>>
>> Besides prescreening, having minimum IAtomContainer objects in memory is
>> the key to performance. As less than one object doesn't make sense :) one
>> IATomContainer at a time is the best. Fingerprints can be pre-calculated
>> and no need to be loaded in-memory at all, let SQL do the prescreening.
>>
>> We've been doing similar things (CDK, relational database, no cartridges)
>> in ambit (ambit.sourceforge.net) for quite few years already. There is
>> downloadable standalone application and a servlet container application war
>> file (to run your own service), as well as a running OpenTox REST services
>> for substructure searching , e.g.
>>
>>
>> https://ambit.uni-plovdiv.bg:8443/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F]
>>
>>
>> http://apps.ideaconsult.net:8080/ambit2/query/smarts?search=c1ccccc1[Cl,Br,F,I]
>>
>> Regards,
>> Nina
>>
>>
>> Regards,
>>
>> Thomas
>>
>>
>> ------------------------------------------------------------------------------
>> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
>> Tap into the largest installed PC base & get more eyes on your game by
>> optimizing for Intel(R) Graphics Technology. Get started today with the
>> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
>> http://p.sf.net/sfu/intelisp-dev2dev
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
>> Tap into the largest installed PC base & get more eyes on your game by
>> optimizing for Intel(R) Graphics Technology. Get started today with the
>> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
>> http://p.sf.net/sfu/intelisp-dev2dev
>> _______________________________________________
>> Cdk-user mailing list
>> Cdk-user@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> _______________________________________________
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user