FWIW: I checked in a lazy bit vector picker this weekend. This still uses the MaxMin algorithm, but does not require either a python callback function or pre-computation of the distance matrix.
There's some sample code for using it (along with timing data) here: http://rdkit.blogspot.ch/2014/08/optimizing-diversity-picking-in-rdkit.html -greg On Thu, Jul 17, 2014 at 11:53 PM, Matthew Lardy <mla...@gmail.com> wrote: > Hi Dave, > > That's interesting, and I'll look into it. As I wrote Greg, I am an > aficionado of the Gobbi & Lee method described here (since we are sharing > our favourite methods): > > http://pubs.acs.org/doi/abs/10.1021/ci025554v > > While the set I am looking at the moment contains only 26K molecules, I > have significantly larger sets behind this one (thus my interest in getting > control of the memory consumption). > > I'll give the BigPicker a try, as I continue searching for a light > diversity selection algorithm for ever increasing data sets. :) > > Thanks Dave! > Matt > > > > On Thu, Jul 17, 2014 at 12:59 AM, David Cosgrove < > davidacosgrov...@gmail.com> wrote: > >> If you don't mind writing some extra code, we've had good success with a >> Monte Carlo implementation of a maximin diversity picker called BigPicker, >> described in Blomberg et al, JCAMD, 23, 513-525 (2009). With this >> implementation, you only need to keep the subset distance matrix in memory. >> At each step, one of the 2 molecules involved in the shortest subset >> distance is swapped out, and a randomly chosen molecule from the pool >> replaces it. The relevant row/column of the subset distance matrix is >> updated and the new minimum interdistance found. A Monte Carlo criterion >> is used to decide whether to accept the swap or not. As the name suggests, >> it can be used on very large datasets. Indeed, in our implementation we >> allowed for the case where the subset was too large for the subset distance >> matrix to be held in memory and the minimum distance was calculated from >> fingerprints on the the fly at each step. That really was slow, but if >> it's the only way of solving the problem... It's worth recognising that >> this sort of algorithm spends a lot of time mucking about improving the >> interdistance in the 4th or 5th decimal place. It's not clear that a subset >> with a minimum interdistance of 0.41567 is definitively better than one of >> 0.41568, so a fairly loose convergence criterion is usually ok. In our >> experience a larger number of shorter runs, to avoid convergence on a bad >> local minimum, is more reliable. >> >> Having said all that, I'd be inclined to agree with Greg that if you're >> only picking 200 compounds from 26000 you're probably going to do just as >> well with a pin. You could be slightly cleverer by only accepting the next >> random selection if it's above a threshold distance from anything you've >> already selected to avoid the pathological case he describes. >> >> Dave >> >> >> >> On Thu, Jul 17, 2014 at 5:19 AM, Greg Landrum <greg.land...@gmail.com> >> wrote: >> >>> one other short thing. >>> If this is the code you are using for the distance matrix: >>> >>> On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy <mla...@gmail.com> >>> wrote: >>> >>>> >>>> dm=[] >>>> for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo >>>> (in the interest of time) >>>> >>>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True)) >>>> dm = array(dm) >>>> >>> >>> Then at least part of the problem is that you are generating the full >>> matrix. I think you intend to have: >>> >>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[i+1:26000],returnDistance=True)) >>> in there. >>> >>> That typo was in the original notebook that you used; I'm going to have >>> to figure out how to fix that. >>> >>> -greg >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Want fast and easy access to all the code in your enterprise? Index and >>> search up to 200,000 lines of code with a free copy of Black Duck >>> Code Sight - the same software that powers the world's largest code >>> search on Ohloh, the Black Duck Open Hub! Try it now. >>> http://p.sf.net/sfu/bds >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss