Hi JP, the problem is caused by the reaction SMARTS that standardizes pyridine *N*-oxides being not very specific and also hitting your molecule, which is not actually an *N*-oxide but rather a *N*-hydroxypyridinium ion. I will submit a PR to fix the reaction pattern; in the meantime you can fix the problem by loading a custom list of normalization reaction SMARTS as shown in this gist:
https://gist.github.com/ptosco/2b19142ff8fd6afdfee12836cec73d4f HTH, cheers p. On Thu, Jun 24, 2021 at 11:40 AM JP Ebejer <jean.p.ebe...@um.edu.mt> wrote: > Apologies I took my sweet time to reply, I went down the standardization > rabbit-hole and went through most of the material (thanks Matthew and > Francois, but also links from other notebooks). The recording of the > OpenScience session is excellent and crystal clear as usual Greg. I > enjoyed that. > > I have collated code to do the standardization as follows (I am putting > this here, for when my future self searches this list for the same thing in > 6 years time*): > > 0. Cleanup > 1. FragmentParent > 2. Uncharge > 3. Canonicalize Tautomer > > My only question left, is whether I should reionize between steps 2 and > 3. What do you think? My opinion is, probably, that there is no harm in > doing so (so I should do it). Earlier, Greg said that cleanup does > reionization, but perhaps it is worth redoing after the uncharge step? Or > is this just a waste of CPU cycles? Any thoughts? > > Also, there is something slightly weird going on. A (successfully) > sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when > passed to Cleanup(...) starts spitting out can't kekulize errors. I have > created a jupyter notebook to highlight this; > https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b. > Any ideas what is going on? IMHO cleanup should not choke on sanitized > (correct) molecules. Is there a way to catch when these errors happen? As > a bonus, FragmentParent(...) on the original sanitized molecule also > exhibits this unexpected behaviour (not shown in the notebook). Could this > be because it's doing an internal cleanup? > > * The exact code is here: > https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/ > > > > > On Fri, 18 Jun 2021 at 15:08, Greg Landrum <greg.land...@gmail.com> wrote: > >> Hi JP, >> >> On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer <jean.p.ebe...@um.edu.mt> >> wrote: >> >>> >>> I am trying to standardize(/normalize?) some molecules from different >>> sources, to generate a set of descriptors for them. I have done this a >>> number of times, and each time I find the process slightly confusing. I >>> have the following questions please, if you don't mind: >>> >>> >> As a starting point in case you want more information about this topic. >> I did a webinar/presentation on this topic earlier this year as part of >> the RSC Open Science series. >> >> My materials for that are in github: >> https://github.com/greglandrum/RSC_OpenScience_Standardization_202104 >> and there's a youtube recording: >> https://www.youtube.com/watch?v=eWTApNX8dJQ >> >> >> >>> 1. What is the relation between molvs and rdkit (I remember there was >>> an integration project between the two a while back). When I call >>> rdMolStandardize does rdkit code or molvs code get called? The github repo >>> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has. >>> >> >> When you call operations from rdMolStandardize it invokes RDKit code. >> That code was started by Susan Leung as a Google Summer of Code project and >> we have continued to improve and expand that code since then. >> >> >>> 2. What is the difference between standardization and normalization of >>> a molecule? Does one automatically imply the other or should these two >>> processes be both run on a molecule? >>> >> >> I would be surprised if there were universal agreement about this, but >> when I use the terms normalization typically refers to making changes to >> molecules to get "functional groups" (loosely defined) into a normal form, >> while standardization is getting the molecules into a standard form in >> preparation for doing something with them. Normalization is often part of >> standardization, standardization can also include things like stripping >> salts, neutralizing molecules, etc. >> Normalization involves applying transformations like converting -N(=O)=O >> to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O; >> >> >>> 3. Specifically, what is the difference between >>> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol), >>> rdMolStandardize.Normalize(mol). Should I call any of these manually three >>> after I run "standardization/cleaning operations" such as uncharging, >>> reionizing, etc? >>> >> >> SanitizeMol() is different from the others: it does a small amount of >> normalization - fixing groups like nitro which are commonly drawn in a >> hypervalent state but which can be represented in a charge-separated form >> without needing weird valences - and some validation - rejecting molecules >> with atoms that have non-physical valences, rejecting molecules that cannot >> be kekulized - and a bunch of chemistry perception - ring finding, >> calculating valences, finding aromatic systems, etc. >> >> rdMolStandardize.Normalize() applies a bunch of standard transformations >> to a molecule. >> >> rdMolStandardize.Cleanup() does a number of standardization operations: >> - removeHs >> - disconnect metal atoms >> - normalize the molecule >> - reionize the molecule >> >> 4. I understand what uncharge does, but what does reionizer do? >>> >> >> Reionizing does two things: >> 1. adds a charge to a small set of free atoms which are likely >> counterions. These include Na, Mg, Cl, etc. >> 1a. if the above added a positive charge: remove an H from an acidic >> group to neutrailze the positive charge that was added. >> 2. Moves negative charges from less acidic groups to more acidic groups. >> >> 5. Is there a way to chain operations together >>> standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order >>> makes sense here), other than creating a class instance for each calling >>> the method, returning a new mol and using this mol in the next operation? >>> >> >> The easy "pipeline" type functions in rdMolStandardize are the xxxParent >> functions. >> - fragmentParent: cleanup(), pick largest fragment >> - chargeParent: fragmentParent(); uncharge() >> >> Note that this list will be more complete in the 2021.09 release. >> >> >>> >>> Apologies for the many questions. Have I missed the documentation about >>> this? I have found some excellent examples here: >>> https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb >>> (thanks!). This is not exactly a cleaning pipeline, but still quite >>> helpful to understand these methods. >>> >>> >> The github link I provide above has some more up-to-date information >> about what the code currently does. >> This all needs to land in the RDKit documentation >> >> -greg >> >> > > -- > > <https://www.um.edu.mt/> > > Dr Jean-Paul Ebejer | Senior Lecturer > > BSc (Hons) (Melita), MSc (Imperial), DPhil (Oxon.) > > Centre for Molecular Medicine and Biobanking > > Office 320, Biomedical Sciences Building, > > University of Malta, Msida, MSD 2080. MALTA. > > T: (00356) 2340 3263 > > > *Associate Member* > > Department of Artificial Intelligence > > > Where am I? <https://bitsilla.com/blog/where-to-find-me/> > > [image: https://twitter.com/dr_jpe] <https://twitter.com/dr_jpe> [image: > https://bitsilla.com/blog/] <https://bitsilla.com/blog/> [image: > https://github.com/jp-um] <https://github.com/jp-um> > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss