On Wed, Nov 24, 2010 at 3:05 PM, Thomas Strunz <beginn...@hotmail.de> wrote:
> In general this works but memory consumption is huge in the simple sample I > made. Do you have any profiling results? Keeping lots of IAtomContainer objects in memory can lead to high memory consumption - these objects are pretty heavyweight > With my arbitrary query structure I get 1586 hits using Fingerprinter and > 1582 using ExtendedFingerprinter, both with default settings. Not to bad but > now I actually never compared if these 1582 are all contained in the 1586 > just assumed it which is naive of course. The Issue is why is there a > difference? Well, the extended fp considers rings and the standard fp doesn't, so I suppose the extended fingerprinter is being more specific. Actually,discussions of benchmarking the accuracy of CDK fingerprints is ongoing > And how do I know the "correct" number of results? The correct number of results is obtained by doing a subgraph isomorphism directly without any intervening fingerprint screen > As comparison > I compared using commercial software available which found 1563 hits (much > faster and with much less memory and also with displaying the structures, it > realy is pretty amazing. makes you wonder). Did this do substructure directly? Or include a fingerprint screen? The CDK UIT class is well known to (likely) be the slowest subgraph isomorphism implementation around :) [in my tests it is > 40x slower than OpenBabel]. You should switch to the SMSD classes > Could be a stereochemistry > issue. Do cdk fingerprints include stereochemistry? Does the > UniversalIsomorphismTester consider it? The CDK does not really support stereochem - Egon has done some initial work, but in general substructure isomorphism does not consider stereochem > I read that the UniversalIsomorphismTester uses a slow Algorithm for > subgraph matching and I then tried to use cdk-1.3.7 and the > org.openscience.cdk.smsd.Isomorphism class but this does not seem to be > ready yet? or I'm just doing it wrong probaly is more likely. > With the TurboSubstructure Algorithm I get inconsistent number of results > because of NullPointerExceptions that sometimes happen and sometimes it runs > through without issue but returning different number of hits like 1010 or > 1025 ( I assume logger doesn't get to log the exception) Maybe Asad could comment > > Maybe structures (Molecules) are read wrongly? Molfiles (stored as varchar > in DB) are read: > > MDLV2000Reader molReader = new MDLV2000Reader(stream); > Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule()); > > I've read about aromaticity detection. Must that be used here after reading > the molecule from molfile? I think you'd need to perform aromaticity perception - the Javadocs should indicate whether it is required or no > What about implicit vs. explicit hydrogens and > reading from Molfile? If the mol file does not have explicity H's, they will be implicit. If you're trying to match explciit H's you'll need to convert the implicit to explicit H's In general, it is difficult to see what's going on without some example code and data -- Rajarshi Guha NIH Chemical Genomics Center ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user