Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Rajarshi Guha Wed, 24 Nov 2010 13:01:53 -0800

On Wed, Nov 24, 2010 at 3:05 PM, Thomas Strunz <beginn...@hotmail.de> wrote:


> In general this works but memory consumption is huge in the simple sample I
> made.

Do you have any profiling results? Keeping lots of IAtomContainer
objects in memory can lead to high memory consumption - these objects
are pretty heavyweight

> With my arbitrary query structure I get 1586 hits using Fingerprinter and
> 1582 using ExtendedFingerprinter, both with default settings. Not to bad but
> now I actually never compared if these 1582 are all contained in the 1586
> just assumed it which is naive of course. The Issue is why is there a
> difference?

Well, the extended fp considers rings and the standard fp doesn't, so
I suppose the extended fingerprinter is being more specific.

Actually,discussions of benchmarking the accuracy of CDK fingerprints is ongoing

> And how do I know the "correct" number of results?

The correct number of results is obtained by doing a subgraph
isomorphism directly without any intervening fingerprint screen

> As comparison
> I compared using commercial software available which found 1563 hits (much
> faster and with much less memory and also with displaying the structures, it
> realy is pretty amazing. makes you wonder).

Did this do substructure directly? Or include a fingerprint screen?
The CDK UIT class is well known to (likely) be the slowest subgraph
isomorphism implementation around :) [in my tests it is > 40x slower
than OpenBabel]. You should switch to the SMSD classes

> Could be a stereochemistry
> issue. Do cdk fingerprints include stereochemistry? Does the
> UniversalIsomorphismTester consider it?

The CDK does not really support stereochem - Egon has done some
initial work, but in general substructure isomorphism does not
consider stereochem

> I read that the UniversalIsomorphismTester uses a slow Algorithm for
> subgraph matching and I then tried to use cdk-1.3.7 and the
> org.openscience.cdk.smsd.Isomorphism class but this does not seem to be
> ready yet? or I'm just doing it wrong probaly is more likely.
> With the TurboSubstructure Algorithm I get inconsistent number of results
> because of NullPointerExceptions that sometimes happen and sometimes it runs
> through without issue but returning different number of hits like 1010 or
> 1025 ( I assume logger doesn't get to log the exception)

Maybe Asad could comment

>
> Maybe structures (Molecules) are read wrongly? Molfiles (stored as varchar
> in DB) are read:
>
> MDLV2000Reader molReader = new MDLV2000Reader(stream);
> Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule());
>
> I've read about aromaticity detection. Must that be used here after reading
> the molecule from molfile?

I think you'd need to perform aromaticity perception - the Javadocs
should indicate whether it is required or no

> What about implicit vs. explicit hydrogens and
> reading from Molfile?

If the mol file does not have explicity H's, they will be implicit. If
you're trying to match explciit H's you'll need to convert the
implicit to explicit H's

In general, it is difficult to see what's going on without some
example code and data

-- 
Rajarshi Guha
NIH Chemical Genomics Center

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to