Re: [Rdkit-discuss] Code efficiency improvement
On 12/19/19 7:27 PM, Francois Berenger wrote: > > You should parallelize the processing of molecules, since each can be > worked at independently. > Well, for "a lot" of conformers on "a lot" of molecules that'll work if you have access to a compute cluster and/or are willing to pay for spinning up a bunch of VMs on amazon etc. Otherwise the best you can hope for is to run maybe two per CPU core. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Hi Rafal, Thank you for this suggestion. I will try these to see the changes. Best, Leon On Thu, Dec 19, 2019 at 4:37 AM Rafal Roszak wrote: > On Wed, 18 Dec 2019 22:54:04 -0500 > "topgunhaides ." wrote: > > > For large and flexiable molecules, will need a lot more than 10K (like > > 100K) to try to cover the entire conformational space. > > In such case > > useExpTorsionAnglePrefs=True, > useBasicKnowledge=True > > can make your conformational set less diverse. > I suggest you to check your space with and without the options. > > Best, > > Rafał > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Hi Michal, Many thanks for the help! The MMFF will be mainly used to remove only the (very) high energy conformers, which is a good news here. There is one dilemma here: Without optimization, some potentially "good" conformers could be filted out, due to the fact that small unreasonable atomic displacements could increase the energy by a lot. With optimization, however, many different conformers will just converge to the same structure. I will probably focus on just energies, without optimization. I even tried UFF, which I found much faster than MMFF94s. Best, Leon On Thu, Dec 19, 2019 at 7:49 AM Michal Krompiec wrote: > For mid-to-lower-energy conformers, MMFF relative energies are > essentially a fancy random-number generator. Still, all depends on > what you need this for. If you just want to filter out (very) high > energy conformers, your approach might work. But if you also want to > perform Boltzmann averaging over conformational ensemble (of lower > energy conformers), you will be disappointed. > BTW conformational analysis of your molecule with CREST (20 OMP > threads, -quick -norotmd) took 219 seconds and yielded 28 conformers > with energy up to 3.5 kcal/mol higher than lowest energy structure. So > it is ~2 orders of magnitude slower than MMFF. > Best, > Michal > > On Thu, 19 Dec 2019 at 03:53, topgunhaides . wrote: > > > > Hi Michal, > > > > Many thanks for the help! I am looking for an ensemble of conformers. > > My priority is to use RDKit to generate a large ensemble of conformers > for each molecule. > > For large and flexiable molecules, will need a lot more than 10K (like > 100K) to try to cover the entire conformational space. > > > > I do not have to use MMFF to optimize all conformers, but I do want to > use MMFF or UFF to get at least the energies of all conformers (which is > also quite time-consuming, even without optimization). > > With the conformer energies, I can call some energy_filtering function > to filter out conformers with high energies, etc. > > I am thinking that storing and processing a huge number of conformers > could be the reason to slow things down, but not quite sure. > > Any suggestions are very welcome! > > > > Best, > > Leon > > > > > > > > > > > > > > > > On Wed, Dec 18, 2019 at 7:08 PM Michal Krompiec < > michal.kromp...@gmail.com> wrote: > >> > >> Are you looking for the global minimum or an ensemble of conformers? > Either way, this is already very fast. Bear in mind, however, that MMFF’s > accuracy isn’t great for this type of tasks (see for example > >> https://arxiv.org/pdf/1705.04308.pdf ). In other words, I don’t see a > use case for generation of 10k or more conformers with MMFF. And super-fast > generation of large conformational ensembles for arbitrary molecules just > isn’t realistic. > >> Best, > >> Michal > >> > >> On Wed, 18 Dec 2019 at 22:40, topgunhaides . > wrote: > >>> > >>> Hi guys, > >>> > >>> Can anyone give me some advices to improve the efficiency of the > embedding code? See example below: > >>> > >>> > >>> import time > >>> from rdkit import Chem > >>> from rdkit.Chem import AllChem > >>> > >>> suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule > (10 heavy atoms) > >>> > >>> for mol in suppl: > >>> mh = Chem.AddHs(mol, addCoords=True) > >>> > >>> # embedding > >>> start = time.time() > >>> AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, > pruneRmsThresh=0.5, > >>>randomSeed=1, numThreads=0, > enforceChirality=True, > >>>useExpTorsionAnglePrefs=True, > useBasicKnowledge=True) > >>> cids = [conf.GetId() for conf in mh.GetConformers()] > >>> end = time.time() > >>> print("time eclipsed: ", end - start) > >>> > >>> > >>> The results: > >>> numConfs=1000, time eclipsed: 10 seconds > >>> numConfs=5000, time eclipsed: 66 seconds > >>> numConfs=1, time eclipsed: 176 seconds > >>> > >>> I need to request a lot more than 1 conformers per molecule and > have a lot of molecules to process. > >>> I also wish to compute conformer energies and hopefully can do > optimization (both are time consuming). So need to make my code as > efficient as possible. Thank you! > >>> > >>> Best, > >>> Leon > >>> > >>> > >>> ___ > >>> Rdkit-discuss mailing list > >>> Rdkit-discuss@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Hi Greg, Many thanks for the help! The main purpose that I am "trying to generate a huge number of conformers for a bunch of molecules" is to reproduce experimentally determined structures. To increase accuracy, I want to try the following at least: - cover the entire (not quite possible for large moleucles, but want to try my best) conformational space (so I need a huge number of conformers) - cover a broad range of molecules (a bunch of molecules) - energy filtering (drop high E conformers, so I need to compute the energies of all conformers at least, without optimization) - RMS filtering (call my own pruning code, but use GetBestRMS to get more reliable RMSD values) etc... Above steps tend to be time-consuming if handling a huge number of conformers (like 10k - 100k). But at first I will need to make the conformer embedding step (and maybe also the energy step) as efficient as possible. About the RMSD pruning for embedding, I did notice that the limitation with small RMSD threshold, as you mentioned. At the same time, I also found that smaller RMSD pruning threshold tends to give me better results (in terms of RMSD, shape similarity, etc.) when finally compared to experimental strucures. The RMSD threshold is going to be optimized, based on the benchmark I will perform later. Improving efficiency is the priority at this moment. I am hoping that my code can just take less than maybe 10 minutes on average to process a molecule like CID 831548. Note my current embedding code is already multi-threaded. Any further comments and suggestions are very welcome! Best, Leon On Thu, Dec 19, 2019 at 3:23 AM Greg Landrum wrote: > Hi Leon, > > If you want to be able to work efficiently on a problem like this, it's > important to first take a step back and think about what you're doing. > > In this particular case you are asking the RDKit to generate 1 > conformers for a molecule and requiring that the RMSD between each of those > conformers is at least 0.5A. For small molecules this is very likely to be > impossible: it's impossible to find 10K physically reasonable conformers > that are 0.5A RMSD apart. > > I pulled down a copy of the SDF for CID 831548 from PubChem and tried > generating 500 conformers for it using the standard ETKDGv2 parameters > (this runs single-threaded, which is why this is comparatively slow), which > does not do RMS pruning > > In [3]: m = Chem.AddHs(Chem.MolFromMolFile('./Structure2D_CID_831548.sdf')) > > In [19]: ps = rdDistGeom.ETKDGv2() > > In [20]: > t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1 > : .2f}') > 31.70 > > In [21]: m.GetNumConformers() > Out[21]: 500 > > You can see I get 500 conformers here, but if I turn on RMS pruning it > takes a bit longer (the RMSD calculation is not free) and only generates 66 > conformers: > > In [22]: ps.pruneRmsThresh = 0.5 > > In [23]: > t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1 > : .2f}') > 33.32 > > In [24]: m.GetNumConformers() > Out[24]: 66 > > > If I try for 1000 conformers it takes twice as long and I still get <100 > results. It's just not possible to find a huge number of physically > reasonable conformers that satisfy the RMSD requirements. > > I am a bit surprised by the scaling of the times that you are seeing: > > numConfs=1000, time eclipsed: 10 seconds >> numConfs=5000, time eclipsed: 66 seconds >> numConfs=1, time eclipsed: 176 seconds > > > I would expect the conformer generation to scale more or less linearly > with the number of conformers being requested, but that's a minor concern > compared to the larger problems here. > > In order to be able to make actually useful suggestions about speeding > things up, it would help if you described why you are trying to generate a > huge number of conformers for a bunch of molecules. > > > On Wed, Dec 18, 2019 at 11:40 PM topgunhaides . > wrote: > >> Hi guys, >> >> Can anyone give me some advices to improve the efficiency of the >> embedding code? See example below: >> >> >> import time >> from rdkit import Chem >> from rdkit.Chem import AllChem >> >> suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule (10 >> heavy atoms) >> >> for mol in suppl: >> mh = Chem.AddHs(mol, addCoords=True) >> >> # embedding >> start = time.time() >> AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, >> pruneRmsThresh=0.5, >>randomSeed=1, numThreads=0, >> enforceChirality=True, >>useExpTorsionAnglePrefs=True, >> useBasicKnowledge=True) >> cids = [conf.GetId() for conf in mh.GetConformers()] >> end = time.time() >> print("time eclipsed: ", end - start) >> >> >> The results: >> numConfs=1000, time eclipsed: 10 seconds >> numConfs=5000, time eclipsed: 66 seconds >> numConfs=1, time eclipsed: 176 seconds >> >> I need to request a lot more than 1 conformers per molecule and have >> a lot of
Re: [Rdkit-discuss] Code efficiency improvement
For mid-to-lower-energy conformers, MMFF relative energies are essentially a fancy random-number generator. Still, all depends on what you need this for. If you just want to filter out (very) high energy conformers, your approach might work. But if you also want to perform Boltzmann averaging over conformational ensemble (of lower energy conformers), you will be disappointed. BTW conformational analysis of your molecule with CREST (20 OMP threads, -quick -norotmd) took 219 seconds and yielded 28 conformers with energy up to 3.5 kcal/mol higher than lowest energy structure. So it is ~2 orders of magnitude slower than MMFF. Best, Michal On Thu, 19 Dec 2019 at 03:53, topgunhaides . wrote: > > Hi Michal, > > Many thanks for the help! I am looking for an ensemble of conformers. > My priority is to use RDKit to generate a large ensemble of conformers for > each molecule. > For large and flexiable molecules, will need a lot more than 10K (like 100K) > to try to cover the entire conformational space. > > I do not have to use MMFF to optimize all conformers, but I do want to use > MMFF or UFF to get at least the energies of all conformers (which is also > quite time-consuming, even without optimization). > With the conformer energies, I can call some energy_filtering function to > filter out conformers with high energies, etc. > I am thinking that storing and processing a huge number of conformers could > be the reason to slow things down, but not quite sure. > Any suggestions are very welcome! > > Best, > Leon > > > > > > > > On Wed, Dec 18, 2019 at 7:08 PM Michal Krompiec > wrote: >> >> Are you looking for the global minimum or an ensemble of conformers? Either >> way, this is already very fast. Bear in mind, however, that MMFF’s accuracy >> isn’t great for this type of tasks (see for example >> https://arxiv.org/pdf/1705.04308.pdf ). In other words, I don’t see a use >> case for generation of 10k or more conformers with MMFF. And super-fast >> generation of large conformational ensembles for arbitrary molecules just >> isn’t realistic. >> Best, >> Michal >> >> On Wed, 18 Dec 2019 at 22:40, topgunhaides . wrote: >>> >>> Hi guys, >>> >>> Can anyone give me some advices to improve the efficiency of the embedding >>> code? See example below: >>> >>> >>> import time >>> from rdkit import Chem >>> from rdkit.Chem import AllChem >>> >>> suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule (10 >>> heavy atoms) >>> >>> for mol in suppl: >>> mh = Chem.AddHs(mol, addCoords=True) >>> >>> # embedding >>> start = time.time() >>> AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, >>> pruneRmsThresh=0.5, >>>randomSeed=1, numThreads=0, >>> enforceChirality=True, >>>useExpTorsionAnglePrefs=True, >>> useBasicKnowledge=True) >>> cids = [conf.GetId() for conf in mh.GetConformers()] >>> end = time.time() >>> print("time eclipsed: ", end - start) >>> >>> >>> The results: >>> numConfs=1000, time eclipsed: 10 seconds >>> numConfs=5000, time eclipsed: 66 seconds >>> numConfs=1, time eclipsed: 176 seconds >>> >>> I need to request a lot more than 1 conformers per molecule and have a >>> lot of molecules to process. >>> I also wish to compute conformer energies and hopefully can do optimization >>> (both are time consuming). So need to make my code as efficient as >>> possible. Thank you! >>> >>> Best, >>> Leon >>> >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
On Wed, 18 Dec 2019 22:54:04 -0500 "topgunhaides ." wrote: > For large and flexiable molecules, will need a lot more than 10K (like > 100K) to try to cover the entire conformational space. In such case useExpTorsionAnglePrefs=True, useBasicKnowledge=True can make your conformational set less diverse. I suggest you to check your space with and without the options. Best, Rafał ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Hi Leon, If you want to be able to work efficiently on a problem like this, it's important to first take a step back and think about what you're doing. In this particular case you are asking the RDKit to generate 1 conformers for a molecule and requiring that the RMSD between each of those conformers is at least 0.5A. For small molecules this is very likely to be impossible: it's impossible to find 10K physically reasonable conformers that are 0.5A RMSD apart. I pulled down a copy of the SDF for CID 831548 from PubChem and tried generating 500 conformers for it using the standard ETKDGv2 parameters (this runs single-threaded, which is why this is comparatively slow), which does not do RMS pruning In [3]: m = Chem.AddHs(Chem.MolFromMolFile('./Structure2D_CID_831548.sdf')) In [19]: ps = rdDistGeom.ETKDGv2() In [20]: t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1 : .2f}') 31.70 In [21]: m.GetNumConformers() Out[21]: 500 You can see I get 500 conformers here, but if I turn on RMS pruning it takes a bit longer (the RMSD calculation is not free) and only generates 66 conformers: In [22]: ps.pruneRmsThresh = 0.5 In [23]: t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1 : .2f}') 33.32 In [24]: m.GetNumConformers() Out[24]: 66 If I try for 1000 conformers it takes twice as long and I still get <100 results. It's just not possible to find a huge number of physically reasonable conformers that satisfy the RMSD requirements. I am a bit surprised by the scaling of the times that you are seeing: numConfs=1000, time eclipsed: 10 seconds > numConfs=5000, time eclipsed: 66 seconds > numConfs=1, time eclipsed: 176 seconds I would expect the conformer generation to scale more or less linearly with the number of conformers being requested, but that's a minor concern compared to the larger problems here. In order to be able to make actually useful suggestions about speeding things up, it would help if you described why you are trying to generate a huge number of conformers for a bunch of molecules. On Wed, Dec 18, 2019 at 11:40 PM topgunhaides . wrote: > Hi guys, > > Can anyone give me some advices to improve the efficiency of the embedding > code? See example below: > > > import time > from rdkit import Chem > from rdkit.Chem import AllChem > > suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule (10 > heavy atoms) > > for mol in suppl: > mh = Chem.AddHs(mol, addCoords=True) > > # embedding > start = time.time() > AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, > pruneRmsThresh=0.5, >randomSeed=1, numThreads=0, > enforceChirality=True, >useExpTorsionAnglePrefs=True, > useBasicKnowledge=True) > cids = [conf.GetId() for conf in mh.GetConformers()] > end = time.time() > print("time eclipsed: ", end - start) > > > The results: > numConfs=1000, time eclipsed: 10 seconds > numConfs=5000, time eclipsed: 66 seconds > numConfs=1, time eclipsed: 176 seconds > > I need to request a lot more than 1 conformers per molecule and have a > lot of molecules to process. > I also wish to compute conformer energies and hopefully can do > optimization (both are time consuming). So need to make my code as > efficient as possible. Thank you! > > Best, > Leon > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Hi Michal, Many thanks for the help! I am looking for an ensemble of conformers. My priority is to use RDKit to generate a large ensemble of conformers for each molecule. For large and flexiable molecules, will need a lot more than 10K (like 100K) to try to cover the entire conformational space. I do not have to use MMFF to optimize all conformers, but I do want to use MMFF or UFF to get at least the energies of all conformers (which is also quite time-consuming, even without optimization). With the conformer energies, I can call some energy_filtering function to filter out conformers with high energies, etc. I am thinking that storing and processing a huge number of conformers could be the reason to slow things down, but not quite sure. Any suggestions are very welcome! Best, Leon On Wed, Dec 18, 2019 at 7:08 PM Michal Krompiec wrote: > Are you looking for the global minimum or an ensemble of conformers? > Either way, this is already very fast. Bear in mind, however, that MMFF’s > accuracy isn’t great for this type of tasks (see for example > https://arxiv.org/pdf/1705.04308.pdf ). In other words, I don’t see a use > case for generation of 10k or more conformers with MMFF. And super-fast > generation of large conformational ensembles for arbitrary molecules just > isn’t realistic. > Best, > Michal > > On Wed, 18 Dec 2019 at 22:40, topgunhaides . wrote: > >> Hi guys, >> >> Can anyone give me some advices to improve the efficiency of the >> embedding code? See example below: >> >> >> import time >> from rdkit import Chem >> from rdkit.Chem import AllChem >> >> suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule (10 >> heavy atoms) >> >> for mol in suppl: >> mh = Chem.AddHs(mol, addCoords=True) >> >> # embedding >> start = time.time() >> AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, >> pruneRmsThresh=0.5, >>randomSeed=1, numThreads=0, >> enforceChirality=True, >>useExpTorsionAnglePrefs=True, >> useBasicKnowledge=True) >> cids = [conf.GetId() for conf in mh.GetConformers()] >> end = time.time() >> print("time eclipsed: ", end - start) >> >> >> The results: >> numConfs=1000, time eclipsed: 10 seconds >> numConfs=5000, time eclipsed: 66 seconds >> numConfs=1, time eclipsed: 176 seconds >> >> I need to request a lot more than 1 conformers per molecule and have >> a lot of molecules to process. >> I also wish to compute conformer energies and hopefully can do >> optimization (both are time consuming). So need to make my code as >> efficient as possible. Thank you! >> >> Best, >> Leon >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Code efficiency improvement
Are you looking for the global minimum or an ensemble of conformers? Either way, this is already very fast. Bear in mind, however, that MMFF’s accuracy isn’t great for this type of tasks (see for example https://arxiv.org/pdf/1705.04308.pdf ). In other words, I don’t see a use case for generation of 10k or more conformers with MMFF. And super-fast generation of large conformational ensembles for arbitrary molecules just isn’t realistic. Best, Michal On Wed, 18 Dec 2019 at 22:40, topgunhaides . wrote: > Hi guys, > > Can anyone give me some advices to improve the efficiency of the embedding > code? See example below: > > > import time > from rdkit import Chem > from rdkit.Chem import AllChem > > suppl = Chem.SDMolSupplier('cid831548.sdf') # medium size molecule (10 > heavy atoms) > > for mol in suppl: > mh = Chem.AddHs(mol, addCoords=True) > > # embedding > start = time.time() > AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100, > pruneRmsThresh=0.5, >randomSeed=1, numThreads=0, > enforceChirality=True, >useExpTorsionAnglePrefs=True, > useBasicKnowledge=True) > cids = [conf.GetId() for conf in mh.GetConformers()] > end = time.time() > print("time eclipsed: ", end - start) > > > The results: > numConfs=1000, time eclipsed: 10 seconds > numConfs=5000, time eclipsed: 66 seconds > numConfs=1, time eclipsed: 176 seconds > > I need to request a lot more than 1 conformers per molecule and have a > lot of molecules to process. > I also wish to compute conformer energies and hopefully can do > optimization (both are time consuming). So need to make my code as > efficient as possible. Thank you! > > Best, > Leon > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss