Hi Matt, maybe squeeze these two lines
zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x is not None] zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims] into one: zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x is not None] because zims keeps the whole file in memory for no good reason :-) (is that sdf.gz big?) Markus On Thu, Jul 17, 2014 at 12:43 AM, Matthew Lardy <mla...@gmail.com> wrote: > Hi Igor, > > Thanks! Maybe I am a throwback, but I prefer the command line to a GUI. > Still I'll give it a whirl! :) > > If you are handling millions of molecules without issue; then my Python > skills are really, really, rusty. Or, I shouldn't be using Python to handle > this much data. :) > > Thanks for the info! > Matt > > > On Wed, Jul 16, 2014 at 3:31 PM, Igor Filippov <igor.v.filip...@gmail.com> > wrote: >> >> Matthew, >> >> Two lines of shameless self-promotion: >> This is exactly the kind of problem for Diversity Genie - >> http://www.diversitygenie.com/ >> It is using RDKit library underneath, but wraps it in a simple, easy to >> use GUI front-end. >> >> Best regards, >> Igor >> >> >> On Wed, Jul 16, 2014 at 6:18 PM, Matthew Lardy <mla...@gmail.com> wrote: >>> >>> Hi all, >>> >>> I have been playing with the diversity selection in RDKit. I am running >>> through a set of ~26,000 molecules to pick a set of 200 diverse molecules. >>> I saw some examples of how to do this in Python (my variant of their script >>> below), but the memory consumption is massive. I burned through ~15GB of >>> memory before I killed it off. Is this about what others have seen, or >>> should I move to doing this in C++ or Java (assuming that others have seen a >>> significantly lower level of memory consumption)? >>> >>> Here is the script: >>> >>> from rdkit import Chem >>> from rdkit.Chem import AllChem >>> from rdkit import DataStructs >>> import gzip >>> from rdkit.Chem import Draw >>> from rdkit.SimDivFilters import rdSimDivPickers >>> >>> zims = [x for x in Chem.ForwardSDMolSupplier(gzip.open('a.sdf.gz')) if x >>> is not None] >>> >>> zims_fps=[AllChem.GetMorganFingerprintAsBitVect(x,2) for x in zims] >>> >>> dm=[] >>> for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo (in >>> the interest of time) >>> >>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True)) >>> dm = array(dm) >>> picker = rdSimDivPickers.MaxMinPicker() >>> ids = picker.Pick(dm,26000,200) >>> list(ids[:200]) >>> >>> >>> Thanks in advance! >>> Matt >>> >>> >>> ------------------------------------------------------------------------------ >>> Want fast and easy access to all the code in your enterprise? Index and >>> search up to 200,000 lines of code with a free copy of Black Duck >>> Code Sight - the same software that powers the world's largest code >>> search on Ohloh, the Black Duck Open Hub! Try it now. >>> http://p.sf.net/sfu/bds >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> > > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss