Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-29 Thread az
Many thanks for your replies

Triggering gc doesn't seem to help (though the object count goes down) 
but reducing how much is processed at a time does. I actually didn't 
have to go down to a molecule-at-a-time level but made due with inputs 
half of the previous size. The RAM still fills up after a few files but 
then stays that way without overflowing the swap and crashing things.

Cheers,
Adam

On 27-Jun-15 16:54, Dmitri Maziuk wrote:
 On 6/27/2015 5:45 AM, Greg Landrum wrote:
 ...
 That's assuming that the futures code (which I don't know) isn't doing
 anything odd behind the scenes to hold onto references.
 Or every mol in supplier holds a pointer to c++ dll that python vm
 doesn't quite know how to garbage-collect, which keeps a
 still-referenced object inside the job, which means jobs=[] creates a
 new array without deleting the old jobs. Who knows.

 The first answer was the right one: process one molecule at a time. Even
 better, split the input file into one-per-molecule, then use ec2 or
 condor or osg to run your one-at-a-time script on all of them at once.

 Dima


 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Dmitri Maziuk
On 6/26/2015 9:48 AM, az wrote:
 Thanks Jean-Paul

 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each
 time a new file is being read (jobs=[]), no ?

No. It's a feature of garbage collection: your memory may come back 
anytime between then and program exit.

Dima


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread David Hall
 On Jun 27, 2015, at 6:05 AM, Dmitri Maziuk dmaz...@bmrb.wisc.edu wrote:
 
 On 6/26/2015 9:48 AM, az wrote:
 Thanks Jean-Paul
 
 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each
 time a new file is being read (jobs=[]), no ?
 
 No. It's a feature of garbage collection: your memory may come back 
 anytime between then and program exit.

One could always trigger garbage collection manually.

https://docs.python.org/2/library/gc.html#gc.collect 
https://docs.python.org/2/library/gc.html#gc.collect

And use gc.get_count() to make sure the count of objects went down with the 
collection.

-David--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Greg Landrum
I apologize that I haven't had a chance to look at this in detail yet, but
I can at least give a quick answer to the below:
Python uses a deterministic scheme for doing garbage collection based on
reference counting, so memory should be freed as soon as you do jobs=[].
That's assuming that the futures code (which I don't know) isn't doing
anything odd behind the scenes to hold onto references.

-greg

On Friday, June 26, 2015, az adam.zalew...@mail.com wrote:

  Thanks Jean-Paul

 You're right that I eat up a lot of memory with large files but I think
 its not the whole story. If it were, my memory should come back each time a
 new file is being read (jobs=[]), no ? Instead I hit my limit after 8-10
 very similar input files, even though the usage after 2-3 is around 1/3 of
 my RAM.

 Cheers,
 Adam

 On 24-Jun-15 17:38, JP wrote:

 Isn't the problem here that you are keeping an array (jobs) and you keep
 adding molecules to it never letting the garbage collector collect/clear
 any memory ?  If your file has a million molecules, you will have an array
 of a million molecules in memory...

  Why dont you process each single molecule (set name / remove similar
 confs etc / remove high energy stuff), write it to file and release it ? in
 the if mol: clause...

  Cheers
 JP

  -
 Jean-Paul Ebejer
 Early Stage Researcher

 On 24 June 2015 at 16:47, az adam.zalew...@mail.com
 javascript:_e(%7B%7D,'cvml','adam.zalew...@mail.com'); wrote:

  Hi

  Using the cookbook code as basis (apologies if I should have posted in
 the corresponding topic), I've put together a script to generate conformers
 for my smiles library. Works like a charm too, aside from the fact that
 after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to
 be accumulating with each iteration). I'd appreciate any hints for getting
 this resolved (any other ones as well).

 Thanks a lot,
 Adam

 the code

 max_workers = 16

 def generateconformations(m, n, name=''):
 m = Chem.AddHs(m)
 ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5,
 randomSeed=1)
 etable=[] ## Gathers conformer energies

 for id in ids:
 ff = AllChem.UFFGetMoleculeForceField(m, confId=id)
 ff.Minimize()
 etable.append(ff.CalcEnergy())

 return PropertyMol(m), list(ids), etable, name

 input_dir, output_dir = sys.argv[1:3]
 n = 75 ## Conformer number

 os.chdir(input_dir)
 for ifile in glob.glob('*.s*'):

 raw_file = open(ifile, 'r') ## To get back molecule name later on
 ofile = os.path.join(output_dir, 'conf_' + ifile)

 if 'smiles' in ifile:
 suppl = Chem.SmilesMolSupplier(ifile, titleLine=False,
 delimiter='\t')
 ofile = ofile.replace('.smiles', '.sdf')
 sdfinput = False

 if not os.path.isfile(ofile):

 writer = Chem.SDWriter(ofile)

 print 'Processing %s' %os.path.abspath(ifile),
 datetime.datetime.now()

 if sdfinput == False:
 with futures.ProcessPoolExecutor(max_workers=max_workers) as
 executor:
 # Submit a set of asynchronous jobs
 jobs = []

 for mol in suppl:
 if mol:
 raw_line = raw_file.readline().split()[1] ##
 extracting molecule name from the olriginal ifile
 job = executor.submit(generateconformations, mol,
 n, raw_line) ## returns molecules and associated ids / untill here the
 conformers cannot be pickled
 jobs.append(job)

 for job in jobs:
 mol, ids, etable, name = job.result()
 mol.SetProp(_Name, name) ## Restoring lost property
 mine = min(etable) ## Lowest conformer energy

 for i in ids:
 if etable[i]  mine + 20: ## Conformers with
 energies greater then min+20 will not be written
 ids.remove(i)
 for i in ids:
 for j in ids:
 if i != j:
 if AllChem.GetConformerRMS(mol, i, j) 
 0.5: ## 0.5 A threshold for keeping conformers
 ids.remove(j)
 for id in ids:
 writer.write(mol, confId=id)

 writer.close()

 else:
 print %s exists, skipping % ofile

 ===





 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 

Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-27 Thread Dmitri Maziuk
On 6/27/2015 5:45 AM, Greg Landrum wrote:
...
 That's assuming that the futures code (which I don't know) isn't doing
 anything odd behind the scenes to hold onto references.

Or every mol in supplier holds a pointer to c++ dll that python vm 
doesn't quite know how to garbage-collect, which keeps a 
still-referenced object inside the job, which means jobs=[] creates a 
new array without deleting the old jobs. Who knows.

The first answer was the right one: process one molecule at a time. Even 
better, split the input file into one-per-molecule, then use ec2 or 
condor or osg to run your one-at-a-time script on all of them at once.

Dima


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-26 Thread az

Thanks Jean-Paul

You're right that I eat up a lot of memory with large files but I think 
its not the whole story. If it were, my memory should come back each 
time a new file is being read (jobs=[]), no ? Instead I hit my limit 
after 8-10 very similar input files, even though the usage after 2-3 is 
around 1/3 of my RAM.


Cheers,
Adam

On 24-Jun-15 17:38, JP wrote:
Isn't the problem here that you are keeping an array (jobs) and you 
keep adding molecules to it never letting the garbage collector 
collect/clear any memory ?  If your file has a million molecules, you 
will have an array of a million molecules in memory...


Why dont you process each single molecule (set name / remove similar 
confs etc / remove high energy stuff), write it to file and release it 
? in the if mol: clause...


Cheers
JP

-
Jean-Paul Ebejer
Early Stage Researcher

On 24 June 2015 at 16:47, az adam.zalew...@mail.com 
mailto:adam.zalew...@mail.com wrote:


Hi

Using the cookbook code as basis (apologies if I should have
posted in the corresponding topic), I've put together a script to
generate conformers for my smiles library. Works like a charm too,
aside from the fact that after 10-20 hours, I'm out of RAM and
swap (the memory consumption seems to be accumulating with each
iteration). I'd appreciate any hints for getting this resolved
(any other ones as well).

Thanks a lot,
Adam

the code

max_workers = 16

def generateconformations(m, n, name=''):
m = Chem.AddHs(m)
ids=AllChem.EmbedMultipleConfs(m, numConfs=n,
pruneRmsThresh=0.5, randomSeed=1)
etable=[] ## Gathers conformer energies

for id in ids:
ff = AllChem.UFFGetMoleculeForceField(m, confId=id)
ff.Minimize()
etable.append(ff.CalcEnergy())

return PropertyMol(m), list(ids), etable, name

input_dir, output_dir = sys.argv[1:3]
n = 75 ## Conformer number

os.chdir(input_dir)
for ifile in glob.glob('*.s*'):

raw_file = open(ifile, 'r') ## To get back molecule name later on
ofile = os.path.join(output_dir, 'conf_' + ifile)

if 'smiles' in ifile:
suppl = Chem.SmilesMolSupplier(ifile, titleLine=False,
delimiter='\t')
ofile = ofile.replace('.smiles', '.sdf')
sdfinput = False

if not os.path.isfile(ofile):

writer = Chem.SDWriter(ofile)

print 'Processing %s' %os.path.abspath(ifile),
datetime.datetime.now()

if sdfinput == False:
with
futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
# Submit a set of asynchronous jobs
jobs = []

for mol in suppl:
if mol:
raw_line = raw_file.readline().split()[1]
## extracting molecule name from the olriginal ifile
job =
executor.submit(generateconformations, mol, n, raw_line) ##
returns molecules and associated ids / untill here the conformers
cannot be pickled
jobs.append(job)

for job in jobs:
mol, ids, etable, name = job.result()
mol.SetProp(_Name, name) ## Restoring lost
property
mine = min(etable) ## Lowest conformer energy

for i in ids:
if etable[i]  mine + 20: ## Conformers
with energies greater then min+20 will not be written
ids.remove(i)
for i in ids:
for j in ids:
if i != j:
if AllChem.GetConformerRMS(mol, i,
j)  0.5: ## 0.5 A threshold for keeping conformers
ids.remove(j)
for id in ids:
writer.write(mol, confId=id)

writer.close()

else:
print %s exists, skipping % ofile

===





--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors
network devices and physical  virtual servers, alerts via email  sms
for fault. Monitor 25 devices for free with no restriction.
Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
mailto:Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network 

[Rdkit-discuss] Memory management during conformer generation

2015-06-24 Thread az

Hi

Using the cookbook code as basis (apologies if I should have posted in 
the corresponding topic), I've put together a script to generate 
conformers for my smiles library. Works like a charm too, aside from the 
fact that after 10-20 hours, I'm out of RAM and swap (the memory 
consumption seems to be accumulating with each iteration). I'd 
appreciate any hints for getting this resolved (any other ones as well).


Thanks a lot,
Adam

the code

max_workers = 16

def generateconformations(m, n, name=''):
m = Chem.AddHs(m)
ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5, 
randomSeed=1)

etable=[] ## Gathers conformer energies

for id in ids:
ff = AllChem.UFFGetMoleculeForceField(m, confId=id)
ff.Minimize()
etable.append(ff.CalcEnergy())

return PropertyMol(m), list(ids), etable, name

input_dir, output_dir = sys.argv[1:3]
n = 75 ## Conformer number

os.chdir(input_dir)
for ifile in glob.glob('*.s*'):

raw_file = open(ifile, 'r') ## To get back molecule name later on
ofile = os.path.join(output_dir, 'conf_' + ifile)

if 'smiles' in ifile:
suppl = Chem.SmilesMolSupplier(ifile, titleLine=False, 
delimiter='\t')

ofile = ofile.replace('.smiles', '.sdf')
sdfinput = False

if not os.path.isfile(ofile):

writer = Chem.SDWriter(ofile)

print 'Processing %s' %os.path.abspath(ifile), 
datetime.datetime.now()


if sdfinput == False:
with futures.ProcessPoolExecutor(max_workers=max_workers) 
as executor:

# Submit a set of asynchronous jobs
jobs = []

for mol in suppl:
if mol:
raw_line = raw_file.readline().split()[1] ## 
extracting molecule name from the olriginal ifile
job = executor.submit(generateconformations, 
mol, n, raw_line) ## returns molecules and associated ids / untill here 
the conformers cannot be pickled

jobs.append(job)

for job in jobs:
mol, ids, etable, name = job.result()
mol.SetProp(_Name, name) ## Restoring lost property
mine = min(etable) ## Lowest conformer energy

for i in ids:
if etable[i]  mine + 20: ## Conformers with 
energies greater then min+20 will not be written

ids.remove(i)
for i in ids:
for j in ids:
if i != j:
if AllChem.GetConformerRMS(mol, i, j)  
0.5: ## 0.5 A threshold for keeping conformers

ids.remove(j)
for id in ids:
writer.write(mol, confId=id)

writer.close()

else:
print %s exists, skipping % ofile

===



--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory management during conformer generation

2015-06-24 Thread JP
Isn't the problem here that you are keeping an array (jobs) and you keep
adding molecules to it never letting the garbage collector collect/clear
any memory ?  If your file has a million molecules, you will have an array
of a million molecules in memory...

Why dont you process each single molecule (set name / remove similar confs
etc / remove high energy stuff), write it to file and release it ? in the
if mol: clause...

Cheers
JP

-
Jean-Paul Ebejer
Early Stage Researcher

On 24 June 2015 at 16:47, az adam.zalew...@mail.com wrote:

  Hi

  Using the cookbook code as basis (apologies if I should have posted in
 the corresponding topic), I've put together a script to generate conformers
 for my smiles library. Works like a charm too, aside from the fact that
 after 10-20 hours, I'm out of RAM and swap (the memory consumption seems to
 be accumulating with each iteration). I'd appreciate any hints for getting
 this resolved (any other ones as well).

 Thanks a lot,
 Adam

 the code

 max_workers = 16

 def generateconformations(m, n, name=''):
 m = Chem.AddHs(m)
 ids=AllChem.EmbedMultipleConfs(m, numConfs=n, pruneRmsThresh=0.5,
 randomSeed=1)
 etable=[] ## Gathers conformer energies

 for id in ids:
 ff = AllChem.UFFGetMoleculeForceField(m, confId=id)
 ff.Minimize()
 etable.append(ff.CalcEnergy())

 return PropertyMol(m), list(ids), etable, name

 input_dir, output_dir = sys.argv[1:3]
 n = 75 ## Conformer number

 os.chdir(input_dir)
 for ifile in glob.glob('*.s*'):

 raw_file = open(ifile, 'r') ## To get back molecule name later on
 ofile = os.path.join(output_dir, 'conf_' + ifile)

 if 'smiles' in ifile:
 suppl = Chem.SmilesMolSupplier(ifile, titleLine=False,
 delimiter='\t')
 ofile = ofile.replace('.smiles', '.sdf')
 sdfinput = False

 if not os.path.isfile(ofile):

 writer = Chem.SDWriter(ofile)

 print 'Processing %s' %os.path.abspath(ifile),
 datetime.datetime.now()

 if sdfinput == False:
 with futures.ProcessPoolExecutor(max_workers=max_workers) as
 executor:
 # Submit a set of asynchronous jobs
 jobs = []

 for mol in suppl:
 if mol:
 raw_line = raw_file.readline().split()[1] ##
 extracting molecule name from the olriginal ifile
 job = executor.submit(generateconformations, mol,
 n, raw_line) ## returns molecules and associated ids / untill here the
 conformers cannot be pickled
 jobs.append(job)

 for job in jobs:
 mol, ids, etable, name = job.result()
 mol.SetProp(_Name, name) ## Restoring lost property
 mine = min(etable) ## Lowest conformer energy

 for i in ids:
 if etable[i]  mine + 20: ## Conformers with
 energies greater then min+20 will not be written
 ids.remove(i)
 for i in ids:
 for j in ids:
 if i != j:
 if AllChem.GetConformerRMS(mol, i, j) 
 0.5: ## 0.5 A threshold for keeping conformers
 ids.remove(j)
 for id in ids:
 writer.write(mol, confId=id)

 writer.close()

 else:
 print %s exists, skipping % ofile

 ===





 --
 Monitor 25 network devices or servers for free with OpManager!
 OpManager is web-based network management software that monitors
 network devices and physical  virtual servers, alerts via email  sms
 for fault. Monitor 25 devices for free with no restriction. Download now
 http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical  virtual servers, alerts via email  sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss