Hi all,

I have been playing with the diversity selection in RDKit.  I am running
through a set of ~26,000 molecules to pick a set of 200 diverse molecules.
I saw some examples of how to do this in Python (my variant of their script
below), but the memory consumption is massive.  I burned through ~15GB of
memory before I killed it off.  Is this about what others have seen, or
should I move to doing this in C++ or Java (assuming that others have seen
a significantly lower level of memory consumption)?

Here is the script:

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import gzip
from rdkit.Chem import Draw
from rdkit.SimDivFilters import rdSimDivPickers

zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x is
not None]

zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims]

dm=[]
for i,fp in enumerate(zims_fps[:26000]):     # only 1000 in the demo (in
the interest of time)

dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
dm = array(dm)
picker = rdSimDivPickers.MaxMinPicker()
ids = picker.Pick(dm,26000,200)
list(ids[:200])


Thanks in advance!
Matt
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to