Hi again,
some further observations. Adjusting the reading code as mentioned in my last
post to
MDLV2000Reader molReader = new MDLV2000Reader(stream);
Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule());
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol);
CDKHueckelAromaticityDetector.detectAromaticity(mol);
makes the smsd algorithms to execute smoothly. So this is important although I
must admit I don't really know what exactly it changes within the Molecule.
The unexpected thing is, that the code is sligthly slower (with my not very
sophisticated test) even with the TurboSubstructure Algorithm compared to the
UniversalIsomorphismTester. As far as I understood that does not make much
sense unless Isomorphism class takes a long time to instantiate.
I also changed the query structure to benzene which returns over 30k hits.
Using UniversalIsomorphismTester takes about 50% of the time of the
TurboSubstructure. I've read some blog-post about the
UniversalIsomorphismTester being slow but I can't confirm the new
implementation is faster, in contrary.
Code: (logging is disabled because that would cripple preformance, Isomorphism
version was copied from the JavaDoc example)
public List<Integer> find(IMolecule queryStructure)
throws CDKException, SQLException {
logger.entry();
Profiler profiler = new Profiler("Substructure Search");
profiler.setLogger(logger);
profiler.start("Setup");
List<Integer> screeningHits = new ArrayList<Integer>();
BitSet queryFP =
manager.getFingerprinter().getFingerprint(queryStructure);
profiler.start("Fingerprint Search");
/* Filter based on Fingerprints. The result should contain all matching
* chemical structures and some false-positives. Exact result depends
* on Fingerprint creation method.
*
* Logically "and-ing" the target Fingerprint to the query Fingerprint
should
* result in the same Fingerprint as the query Fingerprint:
* compared.and(queryFP);
* if (compared.equals(queryFP))...
*
* The matching Molecules are added to a list and loaded from the
database
*/
for (Fingerprint fp : manager.getFingerprints().values()) {
Fingerprint compared = (Fingerprint) fp.clone();
compared.and(queryFP);
if (compared.matches(queryFP)) {
screeningHits.add(fp.getMolId());
logger.debug("Fingerprint Hit Found for molID: {}",
fp.getMolId());
}
}
//if fingerprint comparsion gives 0 hits, Subgraph matching is not
needed
// hence we can return from the method immediatley.
if (screeningHits.isEmpty()) {
return screeningHits;
}
profiler.start("Get found hits");
Map<Integer, IMolecule> similarMols =
manager.getDataAccessLayer().getMolecules(screeningHits);
profiler.start("Subgraph matching");
IMolecule mol;
List<Integer> hits = new ArrayList<Integer>();
/*
* This performs the actually substructure search. Because it is
compute-
* intensive, the initial screen is needed to limit the number of
iterations
* in this step. *
*
*/
//Turbo mode search
//Bond Sensitive is set true
Isomorphism comparison = new Isomorphism(Algorithm.TurboSubStructure,
true);
for (Integer molId : similarMols.keySet()) {
mol = similarMols.get(molId);
// set molecules, remove hydrogens, clean and configure molecule
comparison.init(queryStructure, mol, true, true);
// set chemical filter true
comparison.setChemFilters(false, false, false);
if (comparison.isSubgraph()) {
hits.add(molId);
logger.debug("Molecule with molID: {} contains
queryStructure.", molId);
}
// if (UniversalIsomorphismTester.isSubgraph(mol, queryStructure)) {
// hits.add(molId);
// logger.debug("Molecule with molID: {} contains
queryStructure.", molId);
// }
}
profiler.stop().log();
logger.exit(hits);
return hits;
}
That is the general approach. I also have a version that runs multiple threads,
eg. screening, reading from database, and additional ones that do the subgraph
matching.
So I can get the first results before the search finishes but the total search
time remains about the same, just a bit faster.
My observation is that during the screening Phase CPU usage spikes but during
substructure search, it never goes much above 50 % even with multiple threads.
So is my code borked or is the UniversalIsomorphismTester really faster in this
scenario? I know I'm not doing it very sophisticated and just using random
molecules I draw but the behavior is consistent from small to big resultsets
and from ringed or non-ringed queries.
Ideas?
I'm still puzzeld what the commercial product does. Looked at the Database it
creates (Derby) and nothing special there blob for structure and some integer
columns for fingerprint.
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user