Hi Dave,
That's interesting, and I'll look into it. As I wrote Greg, I am an
aficionado of the Gobbi & Lee method described here (since we are sharing
our favourite methods):
http://pubs.acs.org/doi/abs/10.1021/ci025554v
While the set I am looking at the moment contains only 26K molecules, I
have significantly larger sets behind this one (thus my interest in getting
control of the memory consumption).
I'll give the BigPicker a try, as I continue searching for a light
diversity selection algorithm for ever increasing data sets. :)
Thanks Dave!
Matt
On Thu, Jul 17, 2014 at 12:59 AM, David Cosgrove <davidacosgrov...@gmail.com
> wrote:
> If you don't mind writing some extra code, we've had good success with a
> Monte Carlo implementation of a maximin diversity picker called BigPicker,
> described in Blomberg et al, JCAMD, 23, 513-525 (2009). With this
> implementation, you only need to keep the subset distance matrix in memory.
> At each step, one of the 2 molecules involved in the shortest subset
> distance is swapped out, and a randomly chosen molecule from the pool
> replaces it. The relevant row/column of the subset distance matrix is
> updated and the new minimum interdistance found. A Monte Carlo criterion
> is used to decide whether to accept the swap or not. As the name suggests,
> it can be used on very large datasets. Indeed, in our implementation we
> allowed for the case where the subset was too large for the subset distance
> matrix to be held in memory and the minimum distance was calculated from
> fingerprints on the the fly at each step. That really was slow, but if
> it's the only way of solving the problem... It's worth recognising that
> this sort of algorithm spends a lot of time mucking about improving the
> interdistance in the 4th or 5th decimal place. It's not clear that a subset
> with a minimum interdistance of 0.41567 is definitively better than one of
> 0.41568, so a fairly loose convergence criterion is usually ok. In our
> experience a larger number of shorter runs, to avoid convergence on a bad
> local minimum, is more reliable.
>
> Having said all that, I'd be inclined to agree with Greg that if you're
> only picking 200 compounds from 26000 you're probably going to do just as
> well with a pin. You could be slightly cleverer by only accepting the next
> random selection if it's above a threshold distance from anything you've
> already selected to avoid the pathological case he describes.
>
> Dave
>
>
>
> On Thu, Jul 17, 2014 at 5:19 AM, Greg Landrum <greg.land...@gmail.com>
> wrote:
>
>> one other short thing.
>> If this is the code you are using for the distance matrix:
>>
>> On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy <mla...@gmail.com> wrote:
>>
>>>
>>> dm=[]
>>> for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo (in
>>> the interest of time)
>>>
>>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
>>> dm = array(dm)
>>>
>>
>> Then at least part of the problem is that you are generating the full
>> matrix. I think you intend to have:
>>
>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[i+1:26000],returnDistance=True))
>> in there.
>>
>> That typo was in the original notebook that you used; I'm going to have
>> to figure out how to fix that.
>>
>> -greg
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Want fast and easy access to all the code in your enterprise? Index and
>> search up to 200,000 lines of code with a free copy of Black Duck
>> Code Sight - the same software that powers the world's largest code
>> search on Ohloh, the Black Duck Open Hub! Try it now.
>> http://p.sf.net/sfu/bds
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss