Try using parentheses instead of square brackets. This converts lists to
generators <https://wiki.python.org/moin/Generators>, which will take up
almost no memory.
Haven’t tested it, but here’s how it would impact your code:
from rdkit import Chemfrom rdkit.Chem import AllChemfrom rdkit import
DataStructsimport gzipfrom rdkit.Chem import Drawfrom
rdkit.SimDivFilters import rdSimDivPickers
zims = (x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if
x is not None)
zims_fps = (AllChem.GetMorganFingerprintAsBitVect(x, 2) for x in zims)
# Side note: appending a list in a loop is really really slow in
python#dm=[]#for i,fp in enumerate(zims_fps[:26000]): # only 1000
in the demo (in the interest of time)#
dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))#
dm = array(dm)
# Try this (untested)
dm = array(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[i +
1:26000],returnDistance=True) for i,fp in enumerate(zims_fps[:26000]))
picker = rdSimDivPickers.MaxMinPicker()
ids = picker.Pick(dm,26000,200)
list(ids[:200])
On Wed, Jul 16, 2014 at 5:48 PM, Markus Sitzmann <markus.sitzm...@gmail.com>
wrote:
> Hi Matt,
>
> maybe squeeze these two lines
>
> zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if
> x is not None]
>
> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims]
>
> into one:
>
> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in
> Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x is not None]
>
> because zims keeps the whole file in memory for no good reason :-)
> (is that sdf.gz big?)
>
> Markus
>
> On Thu, Jul 17, 2014 at 12:43 AM, Matthew Lardy <mla...@gmail.com> wrote:
> > Hi Igor,
> >
> > Thanks! Maybe I am a throwback, but I prefer the command line to a GUI.
> > Still I'll give it a whirl! :)
> >
> > If you are handling millions of molecules without issue; then my Python
> > skills are really, really, rusty. Or, I shouldn't be using Python to
> handle
> > this much data. :)
> >
> > Thanks for the info!
> > Matt
> >
> >
> > On Wed, Jul 16, 2014 at 3:31 PM, Igor Filippov <
> igor.v.filip...@gmail.com>
> > wrote:
> >>
> >> Matthew,
> >>
> >> Two lines of shameless self-promotion:
> >> This is exactly the kind of problem for Diversity Genie -
> >> http://www.diversitygenie.com/
> >> It is using RDKit library underneath, but wraps it in a simple, easy to
> >> use GUI front-end.
> >>
> >> Best regards,
> >> Igor
> >>
> >>
> >> On Wed, Jul 16, 2014 at 6:18 PM, Matthew Lardy <mla...@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I have been playing with the diversity selection in RDKit. I am
> running
> >>> through a set of ~26,000 molecules to pick a set of 200 diverse
> molecules.
> >>> I saw some examples of how to do this in Python (my variant of their
> script
> >>> below), but the memory consumption is massive. I burned through ~15GB
> of
> >>> memory before I killed it off. Is this about what others have seen, or
> >>> should I move to doing this in C++ or Java (assuming that others have
> seen a
> >>> significantly lower level of memory consumption)?
> >>>
> >>> Here is the script:
> >>>
> >>> from rdkit import Chem
> >>> from rdkit.Chem import AllChem
> >>> from rdkit import DataStructs
> >>> import gzip
> >>> from rdkit.Chem import Draw
> >>> from rdkit.SimDivFilters import rdSimDivPickers
> >>>
> >>> zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if
> x
> >>> is not None]
> >>>
> >>> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims]
> >>>
> >>> dm=[]
> >>> for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo
> (in
> >>> the interest of time)
> >>>
> >>>
> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
> >>> dm = array(dm)
> >>> picker = rdSimDivPickers.MaxMinPicker()
> >>> ids = picker.Pick(dm,26000,200)
> >>> list(ids[:200])
> >>>
> >>>
> >>> Thanks in advance!
> >>> Matt
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Want fast and easy access to all the code in your enterprise? Index and
> >>> search up to 200,000 lines of code with a free copy of Black Duck
> >>> Code Sight - the same software that powers the world's largest code
> >>> search on Ohloh, the Black Duck Open Hub! Try it now.
> >>> http://p.sf.net/sfu/bds
> >>> _______________________________________________
> >>> Rdkit-discuss mailing list
> >>> Rdkit-discuss@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >>>
> >>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want fast and easy access to all the code in your enterprise? Index and
> > search up to 200,000 lines of code with a free copy of Black Duck
> > Code Sight - the same software that powers the world's largest code
> > search on Ohloh, the Black Duck Open Hub! Try it now.
> > http://p.sf.net/sfu/bds
> > _______________________________________________
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss