Re: [Rdkit-discuss] MaxMin Picker and Python

Greg Landrum Wed, 16 Jul 2014 21:13:29 -0700

On Thu, Jul 17, 2014 at 1:58 AM, Matthew Lardy <mla...@gmail.com> wrote:


>
> It looks like the memory consumption (initially) drops.  Still it gets out
> of control, likely after the file is read.
>

That is, most likely, due to the fact that distance matrix itself is huge.
Still, 26K molecules should be manageable. I did a post on clustering
(which also uses a distance matrix) a while ago that includes code for
generating the distance matrix as well as memory stats:
https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02927.html

Given that the size of the distance matrix scales as N^2, there's really no
way to avoid running into problems without switching to an approach that
does not require pre-computation of the distance matrix. The
MaxMinPicker.LazyPick() method (
http://www.rdkit.org/docs/api/rdkit.SimDivFilters.rdSimDivPickers.MaxMinPicker-class.html#LazyPick)
is supposed to help somewhat with this problem but, due to the internal
caching that it does, will not completely remove it. (This is probably
something I should look into)

Note: for small subsets of large sets you can normally get a quite diverse
subset by just randomly picking. This breaks down if the large set is
pathological and includes one or more clumps that contain a large fraction
of the molecules.

-greg

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] MaxMin Picker and Python

Reply via email to