Re: [Rdkit-discuss] Get full matrix from GetEuclideanDistMat
Hi Lorenzo, As you've discovered, GetEuclideanDistMat() just returns one diagonal of the matrix. I haven't tried to convert this back into an actual symmetric matrix (at least I don't think I have), but it does look like using np.tri works. That only sets the lower diagonal, so you also need to add on the transpose. Maybe try something like this (I've also simplified the calculation of n): lower = GetEuclideanDistMat(descriptors.values) n = len(descriptors.values) mask = np.tri(n, dtype=bool, k=-1) distances = np.zeros((n, n), dtype=float) distances[mask] = lower distances += distances.transpose() Note that if you have scikit learn installed, it's *much* easier to use: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html -greg On Fri, Oct 4, 2019 at 5:14 PM Lorenzo Fabbri via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > I have a matrix of descriptors and I want to use GetEuclideanDistMat to > get the pairwise Euclidean distances. Once I compute it, I need to create a > full matrix (number of compounds x number of compounds) from the 1D vector. > I’m currently using > > lower = GetEuclideanDistMat(descriptors.values) > n = int(np.sqrt(len(lower)*2)) + 1 > mask = np.tri(n, dtype=bool, k=-1) > distances = np.zeros((n, n), dtype=float) > distances[mask] = lower > > It seems to be working but I’m getting some weird results (very small > distances for very different compounds), so I’m guessing I’m doing > something wrong with the code above. Any suggestion? > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to handle metallocenes?
Hi Michal/Greg, Many thanks for your thoughts. Compounds are from PubChem's Substances. I'm of the opinion to filter out these types of molecules, but this may be hard to do with billions of compounds...? What would be an efficient way to check parse drug like compounds, and reject organometallics. Clearly checking every atom/bond is too expensive. Best Mike Get Outlook for Android On Mon, Oct 7, 2019 at 3:20 PM +0100, "Michal Krompiec" wrote: Dear Mike, Try changing all metal-ligand bonds to "dative" or "ionic, and standardize afterwards (but disable adjusting of implicit Hs). This way, I was able to process (in KNIME) >99% of organometallics (incl. metallocenes) downloaded from Reaxys. Example snippet (which doesn't check the "directionality" of the bond, though): from rdkit import Chem import pandas as pd metals=['Ti','Al','Mo','Ru','Co','Rh', 'Ir', 'Ni','Zr', 'Hf', 'W'] outmols=[] mols=input_table['Molecule'] for mol in mols: for bond in mol.GetBonds(): if bond.GetEndAtom().GetSymbol() in metals or bond.GetBeginAtom().GetSymbol() in metals: print("found metal-ligand bond") print("original type: "+ str(bond.GetBondType())) btype=Chem.rdchem.BondType.DATIVE bond.SetBondType(btype) print("changed to: "+ str(mol.GetBonds()[bond.GetIdx()].GetBondType())) try: Chem.SanitizeMol(mol,sanitizeOps=Chem.SanitizeFlags.SANITIZE_ALL^Chem.SanitizeFlags.SANITIZE_ADJUSTHS) except ValueError as ve: print("Sanitization failed") print(ve) output_table = input_table.copy() Best, Michal On Mon, 7 Oct 2019 at 13:45, Greg Landrum wrote: > > Hi Mike, > > I think you mean "organometallics", not "metallocenes" (the two molecules in > that SDF is are coordination complexes, but neither is a metallocene; I > stopped looking after that). The compounds are also drawn in such a way that > they are chemically unreasonable. This is pretty typical for organometallics > in V2000 mol files. > > Unless you have a reliable source of input molecules and/or are willing to > look at every one, I would just filter anything that has a metal-nonmetal > bond out of the dataset. > > If you really want to do something with the molecules: > The rdMolStandardize code, which is derived from MolVS, currently has one > approach for dealing with this type of complex: breaking all the covalent > bonds to the metal (this is also what InChI does). Given what a mess these > compounds are when they show up in most standard file formats, this seems > like a reasonable thing to do: > > In [4]: from rdkit import Chem > > In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize > > In [6]: dcon = rdMolStandardize.MetalDisconnector() > [14:34:03] Initializing MetalDisconnector > > In [8]: suppl = > Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False) > > In [9]: m = suppl[0] > > In [10]: om = dcon.Disconnect(m) > [14:34:29] Running MetalDisconnector > [14:34:29] Removed covalent bond between Tc and O > [14:34:29] Removed covalent bond between Tc and O > [14:34:29] Removed covalent bond between Tc and S > [14:34:29] Removed covalent bond between Tc and S > [14:34:29] Removed covalent bond between Tc and P > [14:34:29] Removed covalent bond between Tc and P > > In [11]: Chem.SanitizeMol(om) > Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE > > In [12]: Chem.MolToSmiles(om) > Out[12]: > 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@@H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]' > > > It's worth noting that this molecule is still a long way from making chemical > sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are not > sensible. So there's more manual fixing required here. > > > Best, > -greg > > > On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz wrote: >> >> Hello RDKit experts ! >> >> >> >> Is there a function to handle metallocenes in the standardizer? >> >> >> >> I’ve enclosed some examples of compounds. >> >> >> >> Thanks, >> >> mike >> >> >> >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to handle metallocenes?
Dear Mike, Try changing all metal-ligand bonds to "dative" or "ionic, and standardize afterwards (but disable adjusting of implicit Hs). This way, I was able to process (in KNIME) >99% of organometallics (incl. metallocenes) downloaded from Reaxys. Example snippet (which doesn't check the "directionality" of the bond, though): from rdkit import Chem import pandas as pd metals=['Ti','Al','Mo','Ru','Co','Rh', 'Ir', 'Ni','Zr', 'Hf', 'W'] outmols=[] mols=input_table['Molecule'] for mol in mols: for bond in mol.GetBonds(): if bond.GetEndAtom().GetSymbol() in metals or bond.GetBeginAtom().GetSymbol() in metals: print("found metal-ligand bond") print("original type: "+ str(bond.GetBondType())) btype=Chem.rdchem.BondType.DATIVE bond.SetBondType(btype) print("changed to: "+ str(mol.GetBonds()[bond.GetIdx()].GetBondType())) try: Chem.SanitizeMol(mol,sanitizeOps=Chem.SanitizeFlags.SANITIZE_ALL^Chem.SanitizeFlags.SANITIZE_ADJUSTHS) except ValueError as ve: print("Sanitization failed") print(ve) output_table = input_table.copy() Best, Michal On Mon, 7 Oct 2019 at 13:45, Greg Landrum wrote: > > Hi Mike, > > I think you mean "organometallics", not "metallocenes" (the two molecules in > that SDF is are coordination complexes, but neither is a metallocene; I > stopped looking after that). The compounds are also drawn in such a way that > they are chemically unreasonable. This is pretty typical for organometallics > in V2000 mol files. > > Unless you have a reliable source of input molecules and/or are willing to > look at every one, I would just filter anything that has a metal-nonmetal > bond out of the dataset. > > If you really want to do something with the molecules: > The rdMolStandardize code, which is derived from MolVS, currently has one > approach for dealing with this type of complex: breaking all the covalent > bonds to the metal (this is also what InChI does). Given what a mess these > compounds are when they show up in most standard file formats, this seems > like a reasonable thing to do: > > In [4]: from rdkit import Chem > > In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize > > In [6]: dcon = rdMolStandardize.MetalDisconnector() > [14:34:03] Initializing MetalDisconnector > > In [8]: suppl = > Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False) > > In [9]: m = suppl[0] > > In [10]: om = dcon.Disconnect(m) > [14:34:29] Running MetalDisconnector > [14:34:29] Removed covalent bond between Tc and O > [14:34:29] Removed covalent bond between Tc and O > [14:34:29] Removed covalent bond between Tc and S > [14:34:29] Removed covalent bond between Tc and S > [14:34:29] Removed covalent bond between Tc and P > [14:34:29] Removed covalent bond between Tc and P > > In [11]: Chem.SanitizeMol(om) > Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE > > In [12]: Chem.MolToSmiles(om) > Out[12]: > 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@@H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]' > > > It's worth noting that this molecule is still a long way from making chemical > sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are not > sensible. So there's more manual fixing required here. > > > Best, > -greg > > > On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz > wrote: >> >> Hello RDKit experts ! >> >> >> >> Is there a function to handle metallocenes in the standardizer? >> >> >> >> I’ve enclosed some examples of compounds. >> >> >> >> Thanks, >> >> mike >> >> >> >> >> >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to handle metallocenes?
Hi Mike, I think you mean "organometallics", not "metallocenes" (the two molecules in that SDF is are coordination complexes, but neither is a metallocene; I stopped looking after that). The compounds are also drawn in such a way that they are chemically unreasonable. This is pretty typical for organometallics in V2000 mol files. Unless you have a reliable source of input molecules and/or are willing to look at every one, I would just filter anything that has a metal-nonmetal bond out of the dataset. If you really want to do something with the molecules: The rdMolStandardize code, which is derived from MolVS, currently has one approach for dealing with this type of complex: breaking all the covalent bonds to the metal (this is also what InChI does). Given what a mess these compounds are when they show up in most standard file formats, this seems like a reasonable thing to do: In [4]: from rdkit import Chem In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize In [6]: dcon = rdMolStandardize.MetalDisconnector() [14:34:03] Initializing MetalDisconnector In [8]: suppl = Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False) In [9]: m = suppl[0] In [10]: om = dcon.Disconnect(m) [14:34:29] Running MetalDisconnector [14:34:29] Removed covalent bond between Tc and O [14:34:29] Removed covalent bond between Tc and O [14:34:29] Removed covalent bond between Tc and S [14:34:29] Removed covalent bond between Tc and S [14:34:29] Removed covalent bond between Tc and P [14:34:29] Removed covalent bond between Tc and P In [11]: Chem.SanitizeMol(om) Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE In [12]: Chem.MolToSmiles(om) Out[12]: 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@ @H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@ @H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]' It's worth noting that this molecule is still a long way from making chemical sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are not sensible. So there's more manual fixing required here. Best, -greg On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz wrote: > Hello RDKit experts ! > > > > Is there a function to handle metallocenes in the standardizer? > > > > I’ve enclosed some examples of compounds. > > > > Thanks, > > mike > > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] how to handle metallocenes?
Hello RDKit experts ! Is there a function to handle metallocenes in the standardizer? I've enclosed some examples of compounds. Thanks, mike RDKit_input.sdf Description: chemical/mdl-sdfile ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Saving chains from PDB file
If you turn off the sanitization the splitting should be super fast too. If that is the only thing you would like to do. pon., 7 paź 2019, 10:31 użytkownik Téletchéa Stéphane < stephane.teletc...@univ-nantes.fr> napisał: > Le 05/10/2019 à 12:46, Chris Swain via Rdkit-discuss a écrit : > > Hi, > > > > I have a number of PDB files (foo.pdb.gz) and I want to separate each > chain in each file out into a separate file. So if a file contains 4 chains > it will generate 4 separate files. > > > > Can I do this using RDKit, if so how? > > > > Cheers > > > > Chris > > Dear Chris, > > Even this could be performed in rdkit, I would recommend doing it using > an external tool, for instance using Biopython and the Bio.PDB module > (https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ), > or even ProDy (http://prody.csb.pitt.edu/). > > Rdkit needs to wrap a lot of atom definitions to load the pdb file > properly, and it takes time (minutes on my machine, which is a decent > workstation :-). > It will be lightning fast using Bio.PDB or prody, compared to rdkit. > > If you still want to use rdkit only, and need to reuse rdkit > representation of the PDB file, then (c)pickle it (python2): > > import cPickle > from rdkit import Chem > > def processReceptor(r): > try: > h=open('receptor.pkl','r') > receptor=cPickle.load(h) > h.close() >except Exception as e: > receptor = Chem.MolFromPDBFile(r) > f=open('receptor.pkl','w') > cPickle.dump(receptor,f) > f.close() > >return receptor > > HTH, > > Stéphane > > -- > Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein > Design In Silico > UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 > Nantes cedex 03, France > Tél : +33 251 125 636 / Fax : +33 251 125 632 > http://www.ufip.univ-nantes.fr/ - http://www.steletch.org > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Saving chains from PDB file
Le 05/10/2019 à 12:46, Chris Swain via Rdkit-discuss a écrit : Hi, I have a number of PDB files (foo.pdb.gz) and I want to separate each chain in each file out into a separate file. So if a file contains 4 chains it will generate 4 separate files. Can I do this using RDKit, if so how? Cheers Chris Dear Chris, Even this could be performed in rdkit, I would recommend doing it using an external tool, for instance using Biopython and the Bio.PDB module (https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ), or even ProDy (http://prody.csb.pitt.edu/). Rdkit needs to wrap a lot of atom definitions to load the pdb file properly, and it takes time (minutes on my machine, which is a decent workstation :-). It will be lightning fast using Bio.PDB or prody, compared to rdkit. If you still want to use rdkit only, and need to reuse rdkit representation of the PDB file, then (c)pickle it (python2): import cPickle from rdkit import Chem def processReceptor(r): try: h=open('receptor.pkl','r') receptor=cPickle.load(h) h.close() except Exception as e: receptor = Chem.MolFromPDBFile(r) f=open('receptor.pkl','w') cPickle.dump(receptor,f) f.close() return receptor HTH, Stéphane -- Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein Design In Silico UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 Nantes cedex 03, France Tél : +33 251 125 636 / Fax : +33 251 125 632 http://www.ufip.univ-nantes.fr/ - http://www.steletch.org ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss