Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

Paolo Tosco Thu, 24 Jun 2021 09:59:53 -0700

Hi JP,

the problem is caused by the reaction SMARTS that standardizes
pyridine *N*-oxides
being not very specific and also hitting your molecule, which is not
actually an *N*-oxide but rather a *N*-hydroxypyridinium ion.
I will submit a PR to fix the reaction pattern; in the meantime you can fix
the problem by loading a custom list of normalization reaction SMARTS as
shown in this gist:


https://gist.github.com/ptosco/2b19142ff8fd6afdfee12836cec73d4f

HTH, cheers
p.

On Thu, Jun 24, 2021 at 11:40 AM JP Ebejer <[email protected]> wrote:

> Apologies I took my sweet time to reply, I went down the standardization
> rabbit-hole and went through most of the material (thanks Matthew and
> Francois, but also links from other notebooks).  The recording of the
> OpenScience session is excellent and crystal clear as usual Greg.  I
> enjoyed that.
>
> I have collated code to do the standardization as follows (I am putting
> this here, for when my future self searches this list for the same thing in
> 6 years time*):
>
> 0. Cleanup
> 1. FragmentParent
> 2. Uncharge
> 3. Canonicalize Tautomer
>
> My only question left, is whether I should reionize between steps 2 and
> 3.  What do you think?  My opinion is, probably, that there is no harm in
> doing so (so I should do it).  Earlier, Greg said that cleanup does
> reionization, but perhaps it is worth redoing after the uncharge step?  Or
> is this just a waste of CPU cycles?  Any thoughts?
>
> Also, there is something slightly weird going on.  A (successfully)
> sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when
> passed to Cleanup(...) starts spitting out can't kekulize errors.  I have
> created a jupyter notebook to highlight this;
> https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b.
> Any ideas what is going on?  IMHO cleanup should not choke on sanitized
> (correct) molecules.  Is there a way to catch when these errors happen?  As
> a bonus, FragmentParent(...) on the original sanitized molecule also
> exhibits this unexpected behaviour (not shown in the notebook). Could this
> be because it's doing an internal cleanup?
>
> * The exact code is here:
> https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
>
>
>
>
> On Fri, 18 Jun 2021 at 15:08, Greg Landrum <[email protected]> wrote:
>
>> Hi JP,
>>
>> On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer <[email protected]>
>> wrote:
>>
>>>
>>> I am trying to standardize(/normalize?) some molecules from different
>>> sources, to generate a set of descriptors for them.  I have done this a
>>> number of times, and each time I find the process slightly confusing.  I
>>> have the following questions please, if you don't mind:
>>>
>>>
>> As a starting point in case you want more information about this topic.
>> I did a webinar/presentation on this topic earlier this year as part of
>> the RSC Open Science series.
>>
>> My materials for that are in github:
>> https://github.com/greglandrum/RSC_OpenScience_Standardization_202104
>> and there's a youtube recording:
>> https://www.youtube.com/watch?v=eWTApNX8dJQ
>>
>>
>>
>>> 1.  What is the relation between molvs and rdkit (I remember there was
>>> an integration project between the two a while back).  When I call
>>> rdMolStandardize does rdkit code or molvs code get called?  The github repo
>>> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has.
>>>
>>
>> When you call operations from rdMolStandardize it invokes RDKit code.
>> That code was started by Susan Leung as a Google Summer of Code project and
>> we have continued to improve and expand that code since then.
>>
>>
>>> 2.  What is the difference between standardization and normalization of
>>> a molecule?  Does one automatically imply the other or should these two
>>> processes be both run on a molecule?
>>>
>>
>> I would be surprised if there were universal agreement about this, but
>> when I use the terms normalization typically refers to making changes to
>> molecules to get "functional groups" (loosely defined) into a normal form,
>> while standardization is getting the molecules into a standard form in
>> preparation for doing something with them. Normalization is often part of
>> standardization, standardization can also include things like stripping
>> salts, neutralizing molecules, etc.
>> Normalization involves applying transformations like converting -N(=O)=O
>> to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O;
>>
>>
>>> 3.  Specifically, what is the difference between
>>> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol),
>>> rdMolStandardize.Normalize(mol).  Should I call any of these manually three
>>> after I run "standardization/cleaning operations" such as uncharging,
>>> reionizing, etc?
>>>
>>
>> SanitizeMol() is different from the others: it does a small amount of
>> normalization - fixing groups like nitro which are commonly drawn in a
>> hypervalent state but which can be represented in a charge-separated form
>> without needing weird valences - and some validation - rejecting molecules
>> with atoms that have non-physical valences, rejecting molecules that cannot
>> be kekulized - and a bunch of chemistry perception - ring finding,
>> calculating valences, finding aromatic systems, etc.
>>
>> rdMolStandardize.Normalize() applies a bunch of standard transformations
>> to a molecule.
>>
>> rdMolStandardize.Cleanup() does a number of standardization operations:
>> - removeHs
>> - disconnect metal atoms
>> - normalize the molecule
>> - reionize the molecule
>>
>> 4.  I understand what uncharge does, but what does reionizer do?
>>>
>>
>> Reionizing does two things:
>> 1. adds a charge to a small set of free atoms which are likely
>> counterions. These include Na, Mg, Cl, etc.
>> 1a. if the above added a positive charge: remove an H from an acidic
>> group to neutrailze the positive charge that was added.
>> 2. Moves negative charges from less acidic groups to more acidic groups.
>>
>> 5.  Is there a way to chain operations together
>>> standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order
>>> makes sense here), other than creating a class instance for each calling
>>> the method, returning a new mol and using this mol in the next operation?
>>>
>>
>> The easy "pipeline" type functions in rdMolStandardize are the xxxParent
>> functions.
>> - fragmentParent: cleanup(), pick largest fragment
>> - chargeParent: fragmentParent(); uncharge()
>>
>> Note that this list will be more complete in the 2021.09 release.
>>
>>
>>>
>>> Apologies for the many questions.  Have I missed the documentation about
>>> this?  I have found some excellent examples here:
>>> https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb
>>> (thanks!).  This is not exactly a cleaning pipeline, but still quite
>>> helpful to understand these methods.
>>>
>>>
>> The github link I provide above has some more up-to-date information
>> about what the code currently does.
>> This all needs to land in the RDKit documentation
>>
>> -greg
>>
>>
>
> --
>
> <https://www.um.edu.mt/>
>
> Dr Jean-Paul Ebejer | Senior Lecturer
>
> BSc (Hons) (Melita), MSc (Imperial), DPhil (Oxon.)
>
> Centre for Molecular Medicine and Biobanking
>
> Office 320, Biomedical Sciences Building,
>
> University of Malta, Msida, MSD 2080.  MALTA.
>
> T: (00356) 2340 3263
>
>
> *Associate Member*
>
> Department of Artificial Intelligence
>
>
> Where am I? <https://bitsilla.com/blog/where-to-find-me/>
>
> [image: https://twitter.com/dr_jpe] <https://twitter.com/dr_jpe> [image:
> https://bitsilla.com/blog/] <https://bitsilla.com/blog/> [image:
> https://github.com/jp-um] <https://github.com/jp-um>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit molecule standardization/normalization protocol

Reply via email to