Apologies I took my sweet time to reply, I went down the standardization rabbit-hole and went through most of the material (thanks Matthew and Francois, but also links from other notebooks). The recording of the OpenScience session is excellent and crystal clear as usual Greg. I enjoyed that.
I have collated code to do the standardization as follows (I am putting this here, for when my future self searches this list for the same thing in 6 years time*): 0. Cleanup 1. FragmentParent 2. Uncharge 3. Canonicalize Tautomer My only question left, is whether I should reionize between steps 2 and 3. What do you think? My opinion is, probably, that there is no harm in doing so (so I should do it). Earlier, Greg said that cleanup does reionization, but perhaps it is worth redoing after the uncharge step? Or is this just a waste of CPU cycles? Any thoughts? Also, there is something slightly weird going on. A (successfully) sanitized mol from SMILES "Cn1c(=O)c2nc[nH][n+](=O)c2n(C)c1=O", which when passed to Cleanup(...) starts spitting out can't kekulize errors. I have created a jupyter notebook to highlight this; https://nbviewer.jupyter.org/gist/jp-um/7cd80faa794b3545e8aedf838a1e7f6b. Any ideas what is going on? IMHO cleanup should not choke on sanitized (correct) molecules. Is there a way to catch when these errors happen? As a bonus, FragmentParent(...) on the original sanitized molecule also exhibits this unexpected behaviour (not shown in the notebook). Could this be because it's doing an internal cleanup? * The exact code is here: https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/ On Fri, 18 Jun 2021 at 15:08, Greg Landrum <greg.land...@gmail.com> wrote: > Hi JP, > > On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer <jean.p.ebe...@um.edu.mt> wrote: > >> >> I am trying to standardize(/normalize?) some molecules from different >> sources, to generate a set of descriptors for them. I have done this a >> number of times, and each time I find the process slightly confusing. I >> have the following questions please, if you don't mind: >> >> > As a starting point in case you want more information about this topic. > I did a webinar/presentation on this topic earlier this year as part of > the RSC Open Science series. > > My materials for that are in github: > https://github.com/greglandrum/RSC_OpenScience_Standardization_202104 > and there's a youtube recording: > https://www.youtube.com/watch?v=eWTApNX8dJQ > > > >> 1. What is the relation between molvs and rdkit (I remember there was an >> integration project between the two a while back). When I call >> rdMolStandardize does rdkit code or molvs code get called? The github repo >> for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has. >> > > When you call operations from rdMolStandardize it invokes RDKit code. That > code was started by Susan Leung as a Google Summer of Code project and we > have continued to improve and expand that code since then. > > >> 2. What is the difference between standardization and normalization of a >> molecule? Does one automatically imply the other or should these two >> processes be both run on a molecule? >> > > I would be surprised if there were universal agreement about this, but > when I use the terms normalization typically refers to making changes to > molecules to get "functional groups" (loosely defined) into a normal form, > while standardization is getting the molecules into a standard form in > preparation for doing something with them. Normalization is often part of > standardization, standardization can also include things like stripping > salts, neutralizing molecules, etc. > Normalization involves applying transformations like converting -N(=O)=O > to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O; > > >> 3. Specifically, what is the difference between >> rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol), >> rdMolStandardize.Normalize(mol). Should I call any of these manually three >> after I run "standardization/cleaning operations" such as uncharging, >> reionizing, etc? >> > > SanitizeMol() is different from the others: it does a small amount of > normalization - fixing groups like nitro which are commonly drawn in a > hypervalent state but which can be represented in a charge-separated form > without needing weird valences - and some validation - rejecting molecules > with atoms that have non-physical valences, rejecting molecules that cannot > be kekulized - and a bunch of chemistry perception - ring finding, > calculating valences, finding aromatic systems, etc. > > rdMolStandardize.Normalize() applies a bunch of standard transformations > to a molecule. > > rdMolStandardize.Cleanup() does a number of standardization operations: > - removeHs > - disconnect metal atoms > - normalize the molecule > - reionize the molecule > > 4. I understand what uncharge does, but what does reionizer do? >> > > Reionizing does two things: > 1. adds a charge to a small set of free atoms which are likely > counterions. These include Na, Mg, Cl, etc. > 1a. if the above added a positive charge: remove an H from an acidic group > to neutrailze the positive charge that was added. > 2. Moves negative charges from less acidic groups to more acidic groups. > > 5. Is there a way to chain operations together >> standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order >> makes sense here), other than creating a class instance for each calling >> the method, returning a new mol and using this mol in the next operation? >> > > The easy "pipeline" type functions in rdMolStandardize are the xxxParent > functions. > - fragmentParent: cleanup(), pick largest fragment > - chargeParent: fragmentParent(); uncharge() > > Note that this list will be more complete in the 2021.09 release. > > >> >> Apologies for the many questions. Have I missed the documentation about >> this? I have found some excellent examples here: >> https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb >> (thanks!). This is not exactly a cleaning pipeline, but still quite >> helpful to understand these methods. >> >> > The github link I provide above has some more up-to-date information about > what the code currently does. > This all needs to land in the RDKit documentation > > -greg > > -- <https://www.um.edu.mt/> Dr Jean-Paul Ebejer | Senior Lecturer BSc (Hons) (Melita), MSc (Imperial), DPhil (Oxon.) Centre for Molecular Medicine and Biobanking Office 320, Biomedical Sciences Building, University of Malta, Msida, MSD 2080. MALTA. T: (00356) 2340 3263 *Associate Member* Department of Artificial Intelligence Where am I? <https://bitsilla.com/blog/where-to-find-me/> [image: https://twitter.com/dr_jpe] <https://twitter.com/dr_jpe> [image: https://bitsilla.com/blog/] <https://bitsilla.com/blog/> [image: https://github.com/jp-um] <https://github.com/jp-um>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss