I wonder why I haven't asked this before. How is this done on OrChem? "These columns provide a quick way to materialize a basic CDK molecule to be passed into the VF2 algorithm. The data structures used are quite straightforward, for instance with data in column atom "C O" interpreted as: "atom 0 is Carbon, atom 1 is Oxygen" and bond column "0 1 D Y" then implying "there is a bond between C (atom 0) and O (atom 1) that is double (D) and aromatic is true (Y)". In this way, CDK molecules can be generated very fast without the need for calculating any properties during the search."
Is this VF2 the turbo-substructure algorithm? Or a custom one? Do you create real cdk molecules or just some kind of graph representation you but in your custom VF2? Which properties do you need for VF2? Implicit hydrogens? Or is it enough to assign an atom it's symbol "C" and each bond an order? I'm kind of confused about the Isomorphism/ExtAtomContainerManipulator class. in the init method, if I choose not to remove hydrogens, the search takes a lot longer. But I have no explicit hydrogens! The ExtAtomContainerManipulator only seems to replace explicit with implicit hydrogen count but that speeds up the search even so I have no explicit hydrogens? i get same number of hits in both cases. I create molecules like this: private IMolecule createMolecule(Integer molId, MDLV2000Reader molReader) throws CDKException { Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule(20, 20, 0, 0)); mol.setID(molId.toString()); AtomContainerManipulator.percieveAtomTypesAndConfigureUnsetProperties(mol); boolean isAromatic = CDKHueckelAromaticityDetector.detectAromaticity(mol); mol.setFlag(CDKConstants.ISAROMATIC, isAromatic); CDKHydrogenAdder hydrogenAdder = CDKHydrogenAdder.getInstance(mol.getBuilder()); hydrogenAdder.addImplicitHydrogens(mol); return mol; } So I should not need to do any configuration before subgraph searching but I need to. it explains why UIT is faster, because the removeHydrogensExceptSingleAndPreserveAtomID method does a lot of stuff. (copy all atoms + bonds). Not adding hydrogens in above code (commenting it out) has no effect. comparison.init(queryStructure, mol, true, false); is a lot faster than comparison.init(queryStructure, mol, false, false); Maybe MDLV2000Reader does something wrong while creating a molecule that is fixed in removeHydrogensExceptSingleAndPreserveAtomID? Regards, Thomas mol file example: ZINC21972410 CDK 0105111003 17 17 0 0 0 0 0 0 0 0999 V2000 2.5359 5.6483 -0.1176 C 0 0 0 0 0 0 0 0 0 0 0 0 2.5559 4.1229 -0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0 3.2578 3.7205 1.2984 C 0 0 0 0 0 0 0 0 0 0 0 0 1.1205 3.5933 0.0107 C 0 0 0 0 0 0 0 0 0 0 0 0 1.1410 2.0865 0.0024 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1994 1.4944 -0.0109 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.0169 1.3968 0.0097 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0021 -0.0041 0.0020 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.1558 -0.6938 0.0094 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.2169 -0.1003 0.0227 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.1358 -2.1698 0.0013 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3346 -2.8981 0.0147 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.2643 -4.2759 0.0123 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.0172 -4.8934 -0.0033 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0967 -4.1779 -0.0212 N 0 0 0 0 0 0 0 0 0 0 0 0 0.0758 -2.8614 -0.0198 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.9420 -6.2462 -0.0053 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 2 4 1 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 5 7 1 0 0 0 0 7 8 1 0 0 0 0 8 9 1 0 0 0 0 9 10 2 0 0 0 0 9 11 1 0 0 0 0 11 16 2 0 0 0 0 11 12 1 0 0 0 0 12 13 2 0 0 0 0 13 14 1 0 0 0 0 14 17 2 0 0 0 0 14 15 1 0 0 0 0 15 16 1 0 0 0 0 M END > From: egon.willigha...@gmail.com > Date: Thu, 24 Feb 2011 13:18:50 +0100 > Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 > Isomorphism Class > To: beginn...@hotmail.de > CC: steinb...@ebi.ac.uk; cdk-user@lists.sourceforge.net; > jeliazkova.n...@gmail.com > > Hej Thomas, > > On Thu, Feb 24, 2011 at 11:30 AM, Thomas Strunz <beginn...@hotmail.de> wrote: > > Problem: > > Can't filter based on H Atoms because not all P and S can be "typed" > > correctly and hence the CDKHydrogenAdder fails and H-Count for the total > > molecule is wrong > > Good. That's something we can work on :) > > See my post of this morning: > > http://chem-bla-ics.blogspot.com/2011/02/adding-cdk-atom-type.html > > We only need to figure out the six properties for the missing atom > type, and the unit test needs an example structure. Preferably from > PubChem, as I can convert that programmatically into CDK code, see: > > http://chem-bla-ics.blogspot.com/2008/05/wicked-chemistry-and-unit-testing.html > > Grtz, > > Egon > > -- > Dr E.L. Willighagen > Postdoctoral Researcher > Institutet för miljömedicin > Karolinska Institutet > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers
------------------------------------------------------------------------------ Free Software Download: Index, Search & Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________ Cdk-user mailing list Cdk-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/cdk-user