Re: [Rdkit-discuss] Memory management during conformer generation
Many thanks for your replies Triggering gc doesn't seem to help (though the object count goes down) but reducing how much is processed at a time does. I actually didn't have to go down to a molecule-at-a-time level but made due with inputs half of the previous size. The RAM still fills up after a few files but then stays that way without overflowing the swap and crashing things. Cheers, Adam On 27-Jun-15 16:54, Dmitri Maziuk wrote: On 6/27/2015 5:45 AM, Greg Landrum wrote: ... That's assuming that the futures code (which I don't know) isn't doing anything odd behind the scenes to hold onto references. Or every mol in supplier holds a pointer to c++ dll that python vm doesn't quite know how to garbage-collect, which keeps a still-referenced object inside the job, which means jobs=[] creates a new array without deleting the old jobs. Who knows. The first answer was the right one: process one molecule at a time. Even better, split the input file into one-per-molecule, then use ec2 or condor or osg to run your one-at-a-time script on all of them at once. Dima -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory management during conformer generation
On 6/26/2015 9:48 AM, az wrote: Thanks Jean-Paul You're right that I eat up a lot of memory with large files but I think its not the whole story. If it were, my memory should come back each time a new file is being read (jobs=[]), no ? No. It's a feature of garbage collection: your memory may come back anytime between then and program exit. Dima -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory management during conformer generation
On Jun 27, 2015, at 6:05 AM, Dmitri Maziuk dmaz...@bmrb.wisc.edu wrote: On 6/26/2015 9:48 AM, az wrote: Thanks Jean-Paul You're right that I eat up a lot of memory with large files but I think its not the whole story. If it were, my memory should come back each time a new file is being read (jobs=[]), no ? No. It's a feature of garbage collection: your memory may come back anytime between then and program exit. One could always trigger garbage collection manually. https://docs.python.org/2/library/gc.html#gc.collect https://docs.python.org/2/library/gc.html#gc.collect And use gc.get_count() to make sure the count of objects went down with the collection. -David-- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory management during conformer generation
I apologize that I haven't had a chance to look at this in detail yet, but I can at least give a quick answer to the below: Python uses a deterministic scheme for doing garbage collection based on reference counting, so memory should be freed as soon as you do jobs=[]. That's assuming that the futures code (which I don't know) isn't doing anything odd behind the scenes to hold onto references. -greg On Friday, June 26, 2015, az adam.zalew...@mail.com wrote: Thanks Jean-Paul You're right that I eat up a lot of memory with large files but I think its not the whole story. If it were, my memory should come back each time a new file is being read (jobs=[]), no ? Instead I hit my limit after 8-10 very similar input files, even though the usage after 2-3 is around 1/3 of my RAM. Cheers, Adam On 24-Jun-15 17:38, JP wrote: Isn't the problem here that you are keeping an array (jobs) and you keep adding molecules to it never letting the garbage collector collect/clear any memory ? If your file has a million molecules, you will have an array of a million molecules in memory... Why dont you process each single molecule (set name / remove similar confs etc / remove high energy stuff), write it to file and release it ? in the if mol: clause... Cheers JP - Jean-Paul Ebejer Early Stage Researcher On 24 June 2015 at 16:47, az adam.zalew...@mail.com javascript:_e(%7B%7D,'cvml','adam.zalew...@mail.com'); wrote: Hi Using the cookbook code as basis (apologies if I should have posted in the corresponding topic), I've put together a script to generate conformers for my smiles library. Works like a charm too, aside from the fact that after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to be accumulating with each iteration). I'd appreciate any hints for getting this resolved (any other ones as well). Thanks a lot, Adam the code max_workers = 16 def generateconformations(m, n, name=''): m = Chem.AddHs(m) ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5, randomSeed=1) etable=[] ## Gathers conformer energies for id in ids: ff = AllChem.UFFGetMoleculeForceField(m, confId=id) ff.Minimize() etable.append(ff.CalcEnergy()) return PropertyMol(m), list(ids), etable, name input_dir, output_dir = sys.argv[1:3] n = 75 ## Conformer number os.chdir(input_dir) for ifile in glob.glob('*.s*'): raw_file = open(ifile, 'r') ## To get back molecule name later on ofile = os.path.join(output_dir, 'conf_' + ifile) if 'smiles' in ifile: suppl = Chem.SmilesMolSupplier(ifile, titleLine=False, delimiter='\t') ofile = ofile.replace('.smiles', '.sdf') sdfinput = False if not os.path.isfile(ofile): writer = Chem.SDWriter(ofile) print 'Processing %s' %os.path.abspath(ifile), datetime.datetime.now() if sdfinput == False: with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # Submit a set of asynchronous jobs jobs = [] for mol in suppl: if mol: raw_line = raw_file.readline().split()[1] ## extracting molecule name from the olriginal ifile job = executor.submit(generateconformations, mol, n, raw_line) ## returns molecules and associated ids / untill here the conformers cannot be pickled jobs.append(job) for job in jobs: mol, ids, etable, name = job.result() mol.SetProp(_Name, name) ## Restoring lost property mine = min(etable) ## Lowest conformer energy for i in ids: if etable[i] mine + 20: ## Conformers with energies greater then min+20 will not be written ids.remove(i) for i in ids: for j in ids: if i != j: if AllChem.GetConformerRMS(mol, i, j) 0.5: ## 0.5 A threshold for keeping conformers ids.remove(j) for id in ids: writer.write(mol, confId=id) writer.close() else: print %s exists, skipping % ofile === -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net
Re: [Rdkit-discuss] Memory management during conformer generation
On 6/27/2015 5:45 AM, Greg Landrum wrote: ... That's assuming that the futures code (which I don't know) isn't doing anything odd behind the scenes to hold onto references. Or every mol in supplier holds a pointer to c++ dll that python vm doesn't quite know how to garbage-collect, which keeps a still-referenced object inside the job, which means jobs=[] creates a new array without deleting the old jobs. Who knows. The first answer was the right one: process one molecule at a time. Even better, split the input file into one-per-molecule, then use ec2 or condor or osg to run your one-at-a-time script on all of them at once. Dima -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory management during conformer generation
Thanks Jean-Paul You're right that I eat up a lot of memory with large files but I think its not the whole story. If it were, my memory should come back each time a new file is being read (jobs=[]), no ? Instead I hit my limit after 8-10 very similar input files, even though the usage after 2-3 is around 1/3 of my RAM. Cheers, Adam On 24-Jun-15 17:38, JP wrote: Isn't the problem here that you are keeping an array (jobs) and you keep adding molecules to it never letting the garbage collector collect/clear any memory ? If your file has a million molecules, you will have an array of a million molecules in memory... Why dont you process each single molecule (set name / remove similar confs etc / remove high energy stuff), write it to file and release it ? in the if mol: clause... Cheers JP - Jean-Paul Ebejer Early Stage Researcher On 24 June 2015 at 16:47, az adam.zalew...@mail.com mailto:adam.zalew...@mail.com wrote: Hi Using the cookbook code as basis (apologies if I should have posted in the corresponding topic), I've put together a script to generate conformers for my smiles library. Works like a charm too, aside from the fact that after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to be accumulating with each iteration). I'd appreciate any hints for getting this resolved (any other ones as well). Thanks a lot, Adam the code max_workers = 16 def generateconformations(m, n, name=''): m = Chem.AddHs(m) ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5, randomSeed=1) etable=[] ## Gathers conformer energies for id in ids: ff = AllChem.UFFGetMoleculeForceField(m, confId=id) ff.Minimize() etable.append(ff.CalcEnergy()) return PropertyMol(m), list(ids), etable, name input_dir, output_dir = sys.argv[1:3] n = 75 ## Conformer number os.chdir(input_dir) for ifile in glob.glob('*.s*'): raw_file = open(ifile, 'r') ## To get back molecule name later on ofile = os.path.join(output_dir, 'conf_' + ifile) if 'smiles' in ifile: suppl = Chem.SmilesMolSupplier(ifile, titleLine=False, delimiter='\t') ofile = ofile.replace('.smiles', '.sdf') sdfinput = False if not os.path.isfile(ofile): writer = Chem.SDWriter(ofile) print 'Processing %s' %os.path.abspath(ifile), datetime.datetime.now() if sdfinput == False: with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # Submit a set of asynchronous jobs jobs = [] for mol in suppl: if mol: raw_line = raw_file.readline().split()[1] ## extracting molecule name from the olriginal ifile job = executor.submit(generateconformations, mol, n, raw_line) ## returns molecules and associated ids / untill here the conformers cannot be pickled jobs.append(job) for job in jobs: mol, ids, etable, name = job.result() mol.SetProp(_Name, name) ## Restoring lost property mine = min(etable) ## Lowest conformer energy for i in ids: if etable[i] mine + 20: ## Conformers with energies greater then min+20 will not be written ids.remove(i) for i in ids: for j in ids: if i != j: if AllChem.GetConformerRMS(mol, i, j) 0.5: ## 0.5 A threshold for keeping conformers ids.remove(j) for id in ids: writer.write(mol, confId=id) writer.close() else: print %s exists, skipping % ofile === -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net mailto:Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network
[Rdkit-discuss] Memory management during conformer generation
Hi Using the cookbook code as basis (apologies if I should have posted in the corresponding topic), I've put together a script to generate conformers for my smiles library. Works like a charm too, aside from the fact that after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to be accumulating with each iteration). I'd appreciate any hints for getting this resolved (any other ones as well). Thanks a lot, Adam the code max_workers = 16 def generateconformations(m, n, name=''): m = Chem.AddHs(m) ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5, randomSeed=1) etable=[] ## Gathers conformer energies for id in ids: ff = AllChem.UFFGetMoleculeForceField(m, confId=id) ff.Minimize() etable.append(ff.CalcEnergy()) return PropertyMol(m), list(ids), etable, name input_dir, output_dir = sys.argv[1:3] n = 75 ## Conformer number os.chdir(input_dir) for ifile in glob.glob('*.s*'): raw_file = open(ifile, 'r') ## To get back molecule name later on ofile = os.path.join(output_dir, 'conf_' + ifile) if 'smiles' in ifile: suppl = Chem.SmilesMolSupplier(ifile, titleLine=False, delimiter='\t') ofile = ofile.replace('.smiles', '.sdf') sdfinput = False if not os.path.isfile(ofile): writer = Chem.SDWriter(ofile) print 'Processing %s' %os.path.abspath(ifile), datetime.datetime.now() if sdfinput == False: with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # Submit a set of asynchronous jobs jobs = [] for mol in suppl: if mol: raw_line = raw_file.readline().split()[1] ## extracting molecule name from the olriginal ifile job = executor.submit(generateconformations, mol, n, raw_line) ## returns molecules and associated ids / untill here the conformers cannot be pickled jobs.append(job) for job in jobs: mol, ids, etable, name = job.result() mol.SetProp(_Name, name) ## Restoring lost property mine = min(etable) ## Lowest conformer energy for i in ids: if etable[i] mine + 20: ## Conformers with energies greater then min+20 will not be written ids.remove(i) for i in ids: for j in ids: if i != j: if AllChem.GetConformerRMS(mol, i, j) 0.5: ## 0.5 A threshold for keeping conformers ids.remove(j) for id in ids: writer.write(mol, confId=id) writer.close() else: print %s exists, skipping % ofile === -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory management during conformer generation
Isn't the problem here that you are keeping an array (jobs) and you keep adding molecules to it never letting the garbage collector collect/clear any memory ? If your file has a million molecules, you will have an array of a million molecules in memory... Why dont you process each single molecule (set name / remove similar confs etc / remove high energy stuff), write it to file and release it ? in the if mol: clause... Cheers JP - Jean-Paul Ebejer Early Stage Researcher On 24 June 2015 at 16:47, az adam.zalew...@mail.com wrote: Hi Using the cookbook code as basis (apologies if I should have posted in the corresponding topic), I've put together a script to generate conformers for my smiles library. Works like a charm too, aside from the fact that after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to be accumulating with each iteration). I'd appreciate any hints for getting this resolved (any other ones as well). Thanks a lot, Adam the code max_workers = 16 def generateconformations(m, n, name=''): m = Chem.AddHs(m) ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5, randomSeed=1) etable=[] ## Gathers conformer energies for id in ids: ff = AllChem.UFFGetMoleculeForceField(m, confId=id) ff.Minimize() etable.append(ff.CalcEnergy()) return PropertyMol(m), list(ids), etable, name input_dir, output_dir = sys.argv[1:3] n = 75 ## Conformer number os.chdir(input_dir) for ifile in glob.glob('*.s*'): raw_file = open(ifile, 'r') ## To get back molecule name later on ofile = os.path.join(output_dir, 'conf_' + ifile) if 'smiles' in ifile: suppl = Chem.SmilesMolSupplier(ifile, titleLine=False, delimiter='\t') ofile = ofile.replace('.smiles', '.sdf') sdfinput = False if not os.path.isfile(ofile): writer = Chem.SDWriter(ofile) print 'Processing %s' %os.path.abspath(ifile), datetime.datetime.now() if sdfinput == False: with futures.ProcessPoolExecutor(max_workers=max_workers) as executor: # Submit a set of asynchronous jobs jobs = [] for mol in suppl: if mol: raw_line = raw_file.readline().split()[1] ## extracting molecule name from the olriginal ifile job = executor.submit(generateconformations, mol, n, raw_line) ## returns molecules and associated ids / untill here the conformers cannot be pickled jobs.append(job) for job in jobs: mol, ids, etable, name = job.result() mol.SetProp(_Name, name) ## Restoring lost property mine = min(etable) ## Lowest conformer energy for i in ids: if etable[i] mine + 20: ## Conformers with energies greater then min+20 will not be written ids.remove(i) for i in ids: for j in ids: if i != j: if AllChem.GetConformerRMS(mol, i, j) 0.5: ## 0.5 A threshold for keeping conformers ids.remove(j) for id in ids: writer.write(mol, confId=id) writer.close() else: print %s exists, skipping % ofile === -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical virtual servers, alerts via email sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss