On Thu, Feb 12, 2009 at 2:55 PM, Andrew Dalke <da...@dalkescientific.com> wrote: > On Feb 12, 2009, at 8:46 AM, Greg Landrum wrote: >> I'm either not understanding completely or I disagree. The queries >> were constructed by fragmenting the molecules I searched through, so >> I'd expect lots of substructure hits (and a lower screen-out rate that >> arbitrary queries against arbitrary molecules). > > Ahh, of course. > > But I don't think fingerprint screen give, say, 0.001% false rates. > I think they are more in line with what you found. But if the bit > distributions were really uncorrelated for molecules where one is > not a substructure of the other, then I would expect extremely > low false positive rates. 2048 bits should give a lot of > discrimination power if the bits weren't correlated.
Agreed, the bit correlation experiment should be done. >> That's a good idea to add to the list of things to look into. It's >> also relatively easy to do because it probably just involves >> increasing the minimum path length included in fingerprints (at least >> as a first step). > > Again, I don't have experience with that, but it means > that there's less ability to handle unlikely atom types. > Yes, the larger subgraphs will include them. Don't know. I suspect the less common atom types aren't a big concern since the larger subgraphs will include them and any sugraph isomorphism involving them will go very quickly (since most things will be screened out in the atom-atom mapping phase) > >> Looking at MACCS is a good idea. I'll also put that on the list. > > Is this list on a wiki? ;) Not yet, but I just put up the page for it: http://code.google.com/p/rdkit/wiki/SubstructureSearchOptimization Now I just need to populate it. -greg