FWIW: I checked in a lazy bit vector picker this weekend. This still uses
the MaxMin algorithm, but does not require either a python callback
function or pre-computation of the distance matrix.

There's some sample code for using it (along with timing data) here:
http://rdkit.blogspot.ch/2014/08/optimizing-diversity-picking-in-rdkit.html

-greg



On Thu, Jul 17, 2014 at 11:53 PM, Matthew Lardy <mla...@gmail.com> wrote:

> Hi Dave,
>
> That's interesting, and I'll look into it.  As I wrote Greg, I am an
> aficionado of the Gobbi & Lee method described here (since we are sharing
> our favourite methods):
>
> http://pubs.acs.org/doi/abs/10.1021/ci025554v
>
> While the set I am looking at the moment contains only 26K molecules, I
> have significantly larger sets behind this one (thus my interest in getting
> control of the memory consumption).
>
> I'll give the BigPicker a try, as I continue searching for a light
> diversity selection algorithm for ever increasing data sets.  :)
>
> Thanks Dave!
> Matt
>
>
>
> On Thu, Jul 17, 2014 at 12:59 AM, David Cosgrove <
> davidacosgrov...@gmail.com> wrote:
>
>> If you don't mind writing some extra code, we've had good success with a
>> Monte Carlo implementation of a maximin diversity picker called BigPicker,
>> described in Blomberg et al, JCAMD, 23, 513-525 (2009).  With this
>> implementation, you only need to keep the subset distance matrix in memory.
>> At each step, one of the 2 molecules involved in the shortest subset
>> distance is swapped out, and a randomly chosen molecule from the pool
>> replaces it.  The relevant row/column of the subset distance matrix is
>> updated and the new minimum interdistance found.  A Monte Carlo criterion
>> is used to decide whether to accept the swap or not.  As the name suggests,
>> it can be used on very large datasets. Indeed, in our implementation we
>> allowed for the case where the subset was too large for the subset distance
>> matrix to be held in memory and the minimum distance was calculated from
>> fingerprints on the the fly at each step.  That really was slow, but if
>> it's the only way of solving the problem...  It's worth recognising that
>> this sort of algorithm spends a lot of time mucking about improving the
>> interdistance in the 4th or 5th decimal place. It's not clear that a subset
>> with a minimum interdistance of 0.41567 is definitively better than one of
>> 0.41568, so a fairly loose convergence criterion is usually ok.  In our
>> experience a larger number of shorter runs, to avoid convergence on a bad
>> local minimum, is more reliable.
>>
>> Having said all that, I'd be inclined to agree with Greg that if you're
>> only picking 200 compounds from 26000 you're probably going to do just as
>> well with a pin. You could be slightly cleverer by only accepting the next
>> random selection if it's above a threshold distance from anything you've
>> already selected to avoid the pathological case he describes.
>>
>> Dave
>>
>>
>>
>> On Thu, Jul 17, 2014 at 5:19 AM, Greg Landrum <greg.land...@gmail.com>
>> wrote:
>>
>>> one other short thing.
>>> If this is the code you are using for the distance matrix:
>>>
>>> On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy <mla...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> dm=[]
>>>> for i,fp in enumerate(zims_fps[:26000]):     # only 1000 in the demo
>>>> (in the interest of time)
>>>>
>>>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
>>>> dm = array(dm)
>>>>
>>>
>>> Then at least part of the problem is that you are generating the full
>>> matrix. I think you intend to have:
>>>
>>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[i+1:26000],returnDistance=True))
>>> in there.
>>>
>>> That typo was in the original notebook that you used; I'm going to have
>>> to figure out how to fix that.
>>>
>>> -greg
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Want fast and easy access to all the code in your enterprise? Index and
>>> search up to 200,000 lines of code with a free copy of Black Duck
>>> Code Sight - the same software that powers the world's largest code
>>> search on Ohloh, the Black Duck Open Hub! Try it now.
>>> http://p.sf.net/sfu/bds
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to