Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Thomas Strunz Thu, 25 Nov 2010 23:42:02 -0800

Hi again,

some further observations. Adjusting the reading code as mentioned in my last 
post to


MDLV2000Reader molReader = new MDLV2000Reader(stream);
Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule());
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol);
CDKHueckelAromaticityDetector.detectAromaticity(mol);

makes the smsd algorithms to execute smoothly. So this is important although I 
must admit I don't really know what exactly it changes within the Molecule.

The unexpected thing is, that the code is sligthly slower (with my not very 
sophisticated test) even with the TurboSubstructure Algorithm compared to the 
UniversalIsomorphismTester. As far as I understood that does not make much 
sense unless Isomorphism class takes a long time to instantiate.
I also changed the query structure to benzene which returns over 30k hits. 
Using UniversalIsomorphismTester takes about 50% of the time of the 
TurboSubstructure. I've read some blog-post about the 
UniversalIsomorphismTester being slow but I can't confirm the new 
implementation is faster, in contrary. 

Code: (logging is disabled because that would cripple preformance, Isomorphism 
version was copied from the JavaDoc example)

    public List<Integer> find(IMolecule queryStructure)
            throws CDKException, SQLException {

        logger.entry();
        Profiler profiler = new Profiler("Substructure Search");
        profiler.setLogger(logger);
        profiler.start("Setup");
        List<Integer> screeningHits = new ArrayList<Integer>();
        BitSet queryFP = 
manager.getFingerprinter().getFingerprint(queryStructure);        
        profiler.start("Fingerprint Search");
        /* Filter based on Fingerprints. The result should contain all matching
         * chemical structures and some false-positives. Exact result depends
         * on Fingerprint creation method.
         *
         * Logically "and-ing" the target Fingerprint to the query Fingerprint 
should
         * result in the same Fingerprint as the query Fingerprint:
         *      compared.and(queryFP);
         *      if (compared.equals(queryFP))...
         *
         * The matching Molecules are added to a list and loaded from the 
database
         */
        for (Fingerprint fp : manager.getFingerprints().values()) {

            Fingerprint compared = (Fingerprint) fp.clone();
            compared.and(queryFP);
            if (compared.matches(queryFP)) {
                screeningHits.add(fp.getMolId());
                logger.debug("Fingerprint Hit Found for molID: {}", 
fp.getMolId());
            }
        }
        //if fingerprint comparsion gives 0 hits, Subgraph matching is not 
needed
        // hence we can return from the method immediatley.
        if (screeningHits.isEmpty()) {
            return screeningHits;
        }
        profiler.start("Get found hits");
        Map<Integer, IMolecule> similarMols = 
manager.getDataAccessLayer().getMolecules(screeningHits);

        profiler.start("Subgraph matching");
        IMolecule mol;
        List<Integer> hits = new ArrayList<Integer>();
        /*
         * This performs the actually substructure search. Because it is 
compute-
         * intensive, the initial screen is needed to limit the number of 
iterations
         * in this step.         *
         *
         */
        //Turbo mode search
        //Bond Sensitive is set true
        Isomorphism comparison = new Isomorphism(Algorithm.TurboSubStructure, 
true);
        for (Integer molId : similarMols.keySet()) {

            mol = similarMols.get(molId);
            // set molecules, remove hydrogens, clean and configure molecule
            comparison.init(queryStructure, mol, true, true);
            // set chemical filter true
            comparison.setChemFilters(false, false, false);
            if (comparison.isSubgraph()) {
                hits.add(molId);
                logger.debug("Molecule with molID: {} contains 
queryStructure.", molId);
            }
//            if (UniversalIsomorphismTester.isSubgraph(mol, queryStructure)) {
//                hits.add(molId);
//                logger.debug("Molecule with molID: {} contains 
queryStructure.", molId);
//            }

        }
        profiler.stop().log();
        logger.exit(hits);
        return hits;
    }

That is the general approach. I also have a version that runs multiple threads, 
eg. screening, reading from database, and additional ones that do the subgraph 
matching.
So I can get the first results before the search finishes but the total search 
time remains about the same, just a bit faster.
My observation is that during the screening Phase CPU usage spikes but during 
substructure search, it never goes much above 50 % even with multiple threads.
So is my code borked or is the UniversalIsomorphismTester really faster in this 
scenario? I know I'm not doing it very sophisticated and just using random 
molecules I draw but the behavior is consistent from small to big resultsets 
and from ringed or non-ringed queries.

Ideas?

I'm still puzzeld what the commercial product does. Looked at the Database it 
creates (Derby) and nothing special there blob for structure and some integer 
columns for fingerprint.

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev

_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to