Re: [Rdkit-discuss] Code efficiency improvement

topgunhaides . Thu, 19 Dec 2019 09:34:28 -0800

Hi Greg,

Many thanks for the help!


The main purpose that I am "trying to generate a huge number of conformers
for a bunch of molecules" is to reproduce experimentally determined
structures.

To increase accuracy, I want to try the following at least:
- cover the entire (not quite possible for large moleucles, but want to try
my best) conformational space (so I need a huge number of conformers)
- cover a broad range of molecules (a bunch of molecules)
- energy filtering (drop high E conformers, so I need to compute the
energies of all conformers at least, without optimization)
- RMS filtering (call my own pruning code, but use GetBestRMS to get more
reliable RMSD values)
etc...

Above steps tend to be time-consuming if handling a huge number of
conformers (like 10k - 100k). But at first I will need to make the
conformer embedding step (and maybe also the energy step) as efficient as
possible.

About the RMSD pruning for embedding, I did notice that the limitation with
small RMSD threshold, as you mentioned. At the same time, I also found that
smaller RMSD pruning threshold tends to give me better results (in terms of
RMSD, shape similarity, etc.) when finally compared to experimental
strucures. The RMSD threshold is going to be optimized, based on the
benchmark I will perform later.

Improving efficiency is the priority at this moment. I am hoping that my
code can just take less than maybe 10 minutes on average to process a
molecule like CID 831548. Note my current embedding code is already
multi-threaded.

Any further comments and suggestions are very welcome!

Best,
Leon



On Thu, Dec 19, 2019 at 3:23 AM Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Leon,
>
> If you want to be able to work efficiently on a problem like this, it's
> important to first take a step back and think about what you're doing.
>
> In this particular case you are asking the RDKit to generate 10000
> conformers for a molecule and requiring that the RMSD between each of those
> conformers is at least 0.5A. For small molecules this is very likely to be
> impossible: it's impossible to find 10K physically reasonable conformers
> that are 0.5A RMSD apart.
>
> I pulled down a copy of the SDF for CID 831548 from PubChem and tried
> generating 500 conformers for it using the standard ETKDGv2 parameters
> (this runs single-threaded, which is why this is comparatively slow), which
> does not do RMS pruning
>
> In [3]: m = Chem.AddHs(Chem.MolFromMolFile('./Structure2D_CID_831548.sdf'))
>
> In [19]: ps = rdDistGeom.ETKDGv2()
>
> In [20]:
> t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1
> : .2f}')
>  31.70
>
> In [21]: m.GetNumConformers()
> Out[21]: 500
>
> You can see I get 500 conformers here, but if I turn on RMS pruning it
> takes a bit longer (the RMSD calculation is not free) and only generates 66
> conformers:
>
> In [22]: ps.pruneRmsThresh = 0.5
>
> In [23]:
> t1=time.time();rdDistGeom.EmbedMultipleConfs(m,500,ps);print(f'{time.time()-t1
> : .2f}')
>  33.32
>
> In [24]: m.GetNumConformers()
> Out[24]: 66
>
>
> If I try for 1000 conformers it takes twice as long and I still get <100
> results. It's just not possible to find a huge number of physically
> reasonable conformers that satisfy the RMSD requirements.
>
> I am a bit surprised by the scaling of the times that you are seeing:
>
> numConfs=1000,   time eclipsed: 10 seconds
>> numConfs=5000,   time eclipsed: 66 seconds
>> numConfs=10000, time eclipsed: 176 seconds
>
>
> I would expect the conformer generation to scale more or less linearly
> with the number of conformers being requested, but that's a minor concern
> compared to the larger problems here.
>
> In order to be able to make actually useful suggestions about speeding
> things up, it would help if you described why you are trying to generate a
> huge number of conformers for a bunch of molecules.
>
>
> On Wed, Dec 18, 2019 at 11:40 PM topgunhaides . <sunzhi....@gmail.com>
> wrote:
>
>> Hi guys,
>>
>> Can anyone give me some advices to improve the efficiency of the
>> embedding code? See example below:
>>
>>
>> import time
>> from rdkit import Chem
>> from rdkit.Chem import AllChem
>>
>> suppl = Chem.SDMolSupplier('cid831548.sdf')   # medium size molecule (10
>> heavy atoms)
>>
>> for mol in suppl:
>>     mh = Chem.AddHs(mol, addCoords=True)
>>
>> # embedding
>>     start = time.time()
>>     AllChem.EmbedMultipleConfs(mh, numConfs=5000, maxAttempts=100,
>> pruneRmsThresh=0.5,
>>                                randomSeed=1, numThreads=0,
>> enforceChirality=True,
>>                                useExpTorsionAnglePrefs=True,
>> useBasicKnowledge=True)
>>     cids = [conf.GetId() for conf in mh.GetConformers()]
>>     end = time.time()
>>     print("time eclipsed: ", end - start)
>>
>>
>> The results:
>> numConfs=1000,   time eclipsed: 10 seconds
>> numConfs=5000,   time eclipsed: 66 seconds
>> numConfs=10000, time eclipsed: 176 seconds
>>
>> I need to request a lot more than 10000 conformers per molecule and have
>> a lot of molecules to process.
>> I also wish to compute conformer energies and hopefully can do
>> optimization (both are time consuming). So need to make my code as
>> efficient as possible. Thank you!
>>
>> Best,
>> Leon
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Code efficiency improvement

Reply via email to