Hi all,
thought quite some time if I should submit the following questions and comments
because I'm not an expert and I fear the text could get too long so no one will
actually read it.
What I'm trying to achieve is a simple SubStructure Search using cdk. The
general idea is that the main work is done in this "search add-in" or however
you want to call it and not in the database (like with cartridges). So you
could use it with any typical RDBMS. Performance matters to a certain extent
but it's not the goal to break records, which would not be possible anyway.
I now have a functional versions of course with severe limitations (memory
usage, stereochemistry, no proof of "finding the correct results" in a real
dataset) using cdk-1.2.7.
What it does is it creates fingerprints for all molecules in the configured
database and stores them in a new table for faster reloading. It supplies
methods to add, remove or update a fingerprint. Fingerprints are held in
memory, which I believed should not be an issue even for much larger datasets
than intended (1024 Bit * 1Mio = 128MByte). But it probably is since the
calculation is too naive.
The substructure search screens using these fingerprints and afterwards selects
the found molecules from the database (based on numeric unique key) and tests
for subgraph using UniversialIsomorphismTester and returns the keys of the
found molecules (Maybe should return the molecules directly but then it's
unknown wether the used Structure viewer can deal with cdk molecules).
In general this works but memory consumption is huge in the simple sample I
made. This sample uses Hsqldb and a subset of a subset (small molecules below
an mw threshold which I don't recall now,a round 300) of Zinc. In total it's
around 65K structures.
With my arbitrary query structure I get 1586 hits using Fingerprinter and 1582
using ExtendedFingerprinter, both with default settings. Not to bad but now I
actually never compared if these 1582 are all contained in the 1586 just
assumed it which is naive of course. The Issue is why is there a difference?
And how do I know the "correct" number of results? As comparison I compared
using commercial software available which found 1563 hits (much faster and with
much less memory and also with displaying the structures, it realy is pretty
amazing. makes you wonder). Could be a stereochemistry issue. Do cdk
fingerprints include stereochemistry? Does the UniversalIsomorphismTester
consider it?
I read that the UniversalIsomorphismTester uses a slow Algorithm for subgraph
matching and I then tried to use cdk-1.3.7 and the
org.openscience.cdk.smsd.Isomorphism class but this does not seem to be ready
yet? or I'm just doing it wrong probaly is more likely.
With the TurboSubstructure Algorithm I get inconsistent number of results
because of NullPointerExceptions that sometimes happen and sometimes it runs
through without issue but returning different number of hits like 1010 or 1025
( I assume logger doesn't get to log the exception)
Exception in thread "Thread-3" java.lang.NullPointerException
at
org.openscience.cdk.smsd.Isomorphism.makeBondMapOfAtomMap(Isomorphism.java:283)
at
org.openscience.cdk.smsd.Isomorphism.makeBondMapsOfAtomMaps(Isomorphism.java:267)
at org.openscience.cdk.smsd.Isomorphism.mcsBuilder(Isomorphism.java:228)
at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:590)
at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:615)
Similar issues with other Algorithms, here an example Error message from VfLib
Algorithm:
Exception in thread "Thread-3" java.lang.IndexOutOfBoundsException: Index: 4,
Size: 0
at java.util.ArrayList.add(ArrayList.java:367)
at
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.setVFMappings(VFlibMCSHandler.java:341)
at
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.searchVFMCSMappings(VFlibMCSHandler.java:250)
at
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.searchMCS(VFlibMCSHandler.java:108)
at org.openscience.cdk.smsd.Isomorphism.vfLibMCS(Isomorphism.java:403)
at
org.openscience.cdk.smsd.Isomorphism.vfLibMCSAlgorithm(Isomorphism.java:552)
at
org.openscience.cdk.smsd.Isomorphism.chooseAlgorithm(Isomorphism.java:325)
at org.openscience.cdk.smsd.Isomorphism.mcsBuilder(Isomorphism.java:224)
at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:590)
at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:615)
Maybe structures (Molecules) are read wrongly? Molfiles (stored as varchar in
DB) are read:
MDLV2000Reader molReader = new MDLV2000Reader(stream);
Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule());
I've read about aromaticity detection. Must that be used here after reading the
molecule from molfile? What about implicit vs. explicit hydrogens and reading
from Molfile?
Thanks for answers and comments in advance.
Regards,
Thomas
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user