Hi JP, On Thu, Jun 17, 2021 at 8:37 PM JP Ebejer <jean.p.ebe...@um.edu.mt> wrote:
> > I am trying to standardize(/normalize?) some molecules from different > sources, to generate a set of descriptors for them. I have done this a > number of times, and each time I find the process slightly confusing. I > have the following questions please, if you don't mind: > > As a starting point in case you want more information about this topic. I did a webinar/presentation on this topic earlier this year as part of the RSC Open Science series. My materials for that are in github: https://github.com/greglandrum/RSC_OpenScience_Standardization_202104 and there's a youtube recording: https://www.youtube.com/watch?v=eWTApNX8dJQ > 1. What is the relation between molvs and rdkit (I remember there was an > integration project between the two a while back). When I call > rdMolStandardize does rdkit code or molvs code get called? The github repo > for molvs hasn't been updated in a while (2 yrs), but rdMolStandardize has. > When you call operations from rdMolStandardize it invokes RDKit code. That code was started by Susan Leung as a Google Summer of Code project and we have continued to improve and expand that code since then. > 2. What is the difference between standardization and normalization of a > molecule? Does one automatically imply the other or should these two > processes be both run on a molecule? > I would be surprised if there were universal agreement about this, but when I use the terms normalization typically refers to making changes to molecules to get "functional groups" (loosely defined) into a normal form, while standardization is getting the molecules into a standard form in preparation for doing something with them. Normalization is often part of standardization, standardization can also include things like stripping salts, neutralizing molecules, etc. Normalization involves applying transformations like converting -N(=O)=O to -[N+](=O)[O-] and converting -[S+2]([O-])[O-] to -S(=O)=O; > 3. Specifically, what is the difference between > rdMolStandardize.Cleanup(mol), Chem.SanitizeMol(mol), > rdMolStandardize.Normalize(mol). Should I call any of these manually three > after I run "standardization/cleaning operations" such as uncharging, > reionizing, etc? > SanitizeMol() is different from the others: it does a small amount of normalization - fixing groups like nitro which are commonly drawn in a hypervalent state but which can be represented in a charge-separated form without needing weird valences - and some validation - rejecting molecules with atoms that have non-physical valences, rejecting molecules that cannot be kekulized - and a bunch of chemistry perception - ring finding, calculating valences, finding aromatic systems, etc. rdMolStandardize.Normalize() applies a bunch of standard transformations to a molecule. rdMolStandardize.Cleanup() does a number of standardization operations: - removeHs - disconnect metal atoms - normalize the molecule - reionize the molecule 4. I understand what uncharge does, but what does reionizer do? > Reionizing does two things: 1. adds a charge to a small set of free atoms which are likely counterions. These include Na, Mg, Cl, etc. 1a. if the above added a positive charge: remove an H from an acidic group to neutrailze the positive charge that was added. 2. Moves negative charges from less acidic groups to more acidic groups. 5. Is there a way to chain operations together > standardize+ChooseLargestFragment+uncharge+normalize (am not sure the order > makes sense here), other than creating a class instance for each calling > the method, returning a new mol and using this mol in the next operation? > The easy "pipeline" type functions in rdMolStandardize are the xxxParent functions. - fragmentParent: cleanup(), pick largest fragment - chargeParent: fragmentParent(); uncharge() Note that this list will be more complete in the 2021.09 release. > > Apologies for the many questions. Have I missed the documentation about > this? I have found some excellent examples here: > https://github.com/susanhleung/rdkit/blob/dev/GSOC2018_MolVS_Integration/rdkit/Chem/MolStandardize/tutorial/MolStandardize.ipynb > (thanks!). This is not exactly a cleaning pipeline, but still quite > helpful to understand these methods. > > The github link I provide above has some more up-to-date information about what the code currently does. This all needs to land in the RDKit documentation -greg
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss