Hi Markus,

It looks like the memory consumption (initially) drops.  Still it gets out
of control, likely after the file is read.

Here is the file info:
-rw-rw-r--. 1 mlardy mlardy 1.6M Jul 16 16:40 a.sdf.gz

Looking into Patrick's suggestion, I got the first error:
NameError: name 'array' is not defined

Which I fixed by adding another import statement:
from array import array

This generated another error:
TypeError: 'generator' object is unsubscriptable

Sorry that I am all thumbs with Python, but thank you for the help so far!
Matt


On Wed, Jul 16, 2014 at 3:48 PM, Markus Sitzmann <markus.sitzm...@gmail.com>
wrote:

> Hi Matt,
>
> maybe squeeze these two lines
>
> zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if
> x is not None]
>
> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims]
>
> into one:
>
> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in
> Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x is not None]
>
> because zims keeps the whole file in memory for no good reason  :-)
> (is that sdf.gz big?)
>
> Markus
>
> On Thu, Jul 17, 2014 at 12:43 AM, Matthew Lardy <mla...@gmail.com> wrote:
> > Hi Igor,
> >
> > Thanks!  Maybe I am a throwback, but I prefer the command line to a GUI.
> > Still I'll give it a whirl!  :)
> >
> > If you are handling millions of molecules without issue; then my Python
> > skills are really, really, rusty.  Or, I shouldn't be using Python to
> handle
> > this much data.  :)
> >
> > Thanks for the info!
> > Matt
> >
> >
> > On Wed, Jul 16, 2014 at 3:31 PM, Igor Filippov <
> igor.v.filip...@gmail.com>
> > wrote:
> >>
> >> Matthew,
> >>
> >> Two lines of shameless self-promotion:
> >> This is exactly the kind of problem for Diversity Genie -
> >> http://www.diversitygenie.com/
> >> It is using RDKit library underneath, but wraps it in a simple, easy to
> >> use GUI front-end.
> >>
> >> Best regards,
> >> Igor
> >>
> >>
> >> On Wed, Jul 16, 2014 at 6:18 PM, Matthew Lardy <mla...@gmail.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I have been playing with the diversity selection in RDKit.  I am
> running
> >>> through a set of ~26,000 molecules to pick a set of 200 diverse
> molecules.
> >>> I saw some examples of how to do this in Python (my variant of their
> script
> >>> below), but the memory consumption is massive.  I burned through ~15GB
> of
> >>> memory before I killed it off.  Is this about what others have seen, or
> >>> should I move to doing this in C++ or Java (assuming that others have
> seen a
> >>> significantly lower level of memory consumption)?
> >>>
> >>> Here is the script:
> >>>
> >>> from rdkit import Chem
> >>> from rdkit.Chem import AllChem
> >>> from rdkit import DataStructs
> >>> import gzip
> >>> from rdkit.Chem import Draw
> >>> from rdkit.SimDivFilters import rdSimDivPickers
> >>>
> >>> zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if
> x
> >>> is not None]
> >>>
> >>> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims]
> >>>
> >>> dm=[]
> >>> for i,fp in enumerate(zims_fps[:26000]):     # only 1000 in the demo
> (in
> >>> the interest of time)
> >>>
> >>>
> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
> >>> dm = array(dm)
> >>> picker = rdSimDivPickers.MaxMinPicker()
> >>> ids = picker.Pick(dm,26000,200)
> >>> list(ids[:200])
> >>>
> >>>
> >>> Thanks in advance!
> >>> Matt
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> Want fast and easy access to all the code in your enterprise? Index and
> >>> search up to 200,000 lines of code with a free copy of Black Duck
> >>> Code Sight - the same software that powers the world's largest code
> >>> search on Ohloh, the Black Duck Open Hub! Try it now.
> >>> http://p.sf.net/sfu/bds
> >>> _______________________________________________
> >>> Rdkit-discuss mailing list
> >>> Rdkit-discuss@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >>>
> >>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want fast and easy access to all the code in your enterprise? Index and
> > search up to 200,000 lines of code with a free copy of Black Duck
> > Code Sight - the same software that powers the world's largest code
> > search on Ohloh, the Black Duck Open Hub! Try it now.
> > http://p.sf.net/sfu/bds
> > _______________________________________________
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to