Hi all,

thought quite some time if I should submit the following questions and comments 
because I'm not an expert and I fear the text could get too long so no one will 
actually read it.

What I'm trying to achieve is a simple SubStructure Search using cdk. The 
general idea is that the main work is done in this "search add-in" or however 
you want to call it and not in the database (like with cartridges). So you 
could use it with any typical RDBMS. Performance matters to a certain extent 
but it's not the goal to break records, which would not be possible anyway.

I now have a functional versions of course with severe limitations (memory 
usage, stereochemistry, no proof of "finding the correct results" in a real 
dataset) using cdk-1.2.7.
What it does is it creates fingerprints for all molecules in the configured 
database and stores them in a new table for faster reloading. It supplies 
methods to add, remove or update a fingerprint. Fingerprints are held in 
memory, which I believed should not be an issue even for much larger datasets 
than intended (1024 Bit * 1Mio = 128MByte). But it probably is since the 
calculation is too naive. 

The substructure search screens using these fingerprints and afterwards selects 
the found molecules from the database (based on numeric unique key) and tests 
for subgraph using UniversialIsomorphismTester and returns the keys of the 
found molecules (Maybe should return the molecules directly but then it's 
unknown wether the used Structure viewer can deal with cdk molecules).
In general this works but memory consumption is huge in the simple sample I 
made. This sample uses Hsqldb and a subset of a subset (small molecules below 
an mw threshold which I don't recall now,a round 300)  of Zinc. In total it's 
around 65K structures.

With my arbitrary query structure I get 1586 hits using Fingerprinter and 1582 
using ExtendedFingerprinter, both with default settings. Not to bad but now I 
actually never compared if these 1582 are all contained in the 1586 just 
assumed it which is naive of course. The Issue is why is there a difference? 
And how do I know the "correct" number of results? As comparison I compared 
using commercial software available which found 1563 hits (much faster and with 
much less memory and also with displaying the structures, it realy is pretty 
amazing. makes you wonder).  Could be a stereochemistry issue. Do cdk 
fingerprints include stereochemistry? Does the UniversalIsomorphismTester 
consider it?

I read that the UniversalIsomorphismTester uses a slow Algorithm for subgraph 
matching and I then tried to use cdk-1.3.7 and the 
org.openscience.cdk.smsd.Isomorphism class but this does not seem to be ready 
yet? or I'm just doing it wrong probaly is more likely.
With the TurboSubstructure  Algorithm I get inconsistent number of results 
because of NullPointerExceptions that sometimes happen and sometimes it runs 
through without issue but returning different number of hits like 1010 or 1025 
( I assume logger doesn't get to log the exception)

Exception in thread "Thread-3" java.lang.NullPointerException
        at 
org.openscience.cdk.smsd.Isomorphism.makeBondMapOfAtomMap(Isomorphism.java:283)
        at 
org.openscience.cdk.smsd.Isomorphism.makeBondMapsOfAtomMaps(Isomorphism.java:267)
        at org.openscience.cdk.smsd.Isomorphism.mcsBuilder(Isomorphism.java:228)
        at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:590)
        at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:615)

Similar issues with other Algorithms, here an example Error message from VfLib 
Algorithm:

Exception in thread "Thread-3" java.lang.IndexOutOfBoundsException: Index: 4, 
Size: 0
        at java.util.ArrayList.add(ArrayList.java:367)
        at 
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.setVFMappings(VFlibMCSHandler.java:341)
        at 
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.searchVFMCSMappings(VFlibMCSHandler.java:250)
        at 
org.openscience.cdk.smsd.algorithm.vflib.VFlibMCSHandler.searchMCS(VFlibMCSHandler.java:108)
        at org.openscience.cdk.smsd.Isomorphism.vfLibMCS(Isomorphism.java:403)
        at 
org.openscience.cdk.smsd.Isomorphism.vfLibMCSAlgorithm(Isomorphism.java:552)
        at 
org.openscience.cdk.smsd.Isomorphism.chooseAlgorithm(Isomorphism.java:325)
        at org.openscience.cdk.smsd.Isomorphism.mcsBuilder(Isomorphism.java:224)
        at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:590)
        at org.openscience.cdk.smsd.Isomorphism.init(Isomorphism.java:615)

Maybe structures (Molecules) are read wrongly? Molfiles (stored as varchar in 
DB) are read:

MDLV2000Reader molReader = new MDLV2000Reader(stream);
Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule());

I've read about aromaticity detection. Must that be used here after reading the 
molecule from molfile? What about implicit vs. explicit hydrogens and reading 
from Molfile?

Thanks for answers and comments in advance.

Regards,

Thomas



                                          
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to