On Thu, Jul 17, 2014 at 1:58 AM, Matthew Lardy <mla...@gmail.com> wrote:
>
> It looks like the memory consumption (initially) drops. Still it gets out
> of control, likely after the file is read.
>
That is, most likely, due to the fact that distance matrix itself is huge.
Still, 26K molecules should be manageable. I did a post on clustering
(which also uses a distance matrix) a while ago that includes code for
generating the distance matrix as well as memory stats:
https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02927.html
Given that the size of the distance matrix scales as N^2, there's really no
way to avoid running into problems without switching to an approach that
does not require pre-computation of the distance matrix. The
MaxMinPicker.LazyPick() method (
http://www.rdkit.org/docs/api/rdkit.SimDivFilters.rdSimDivPickers.MaxMinPicker-class.html#LazyPick)
is supposed to help somewhat with this problem but, due to the internal
caching that it does, will not completely remove it. (This is probably
something I should look into)
Note: for small subsets of large sets you can normally get a quite diverse
subset by just randomly picking. This breaks down if the large set is
pathological and includes one or more clumps that contain a large fraction
of the molecules.
-greg
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss