Re: [Rdkit-discuss] Get full matrix from GetEuclideanDistMat

2019-10-07 Thread Greg Landrum
Hi Lorenzo,

As you've discovered, GetEuclideanDistMat() just returns one diagonal of
the matrix.
I haven't tried to convert this back into an actual symmetric matrix (at
least I don't think I have), but it does look like using np.tri works. That
only sets the lower diagonal, so you also need to add on the transpose.
Maybe try something like this (I've also simplified the calculation of n):

lower = GetEuclideanDistMat(descriptors.values)
n = len(descriptors.values)
mask = np.tri(n, dtype=bool, k=-1)
distances = np.zeros((n, n), dtype=float)
distances[mask] = lower
distances += distances.transpose()


Note that if you have scikit learn installed, it's *much* easier to use:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

-greg


On Fri, Oct 4, 2019 at 5:14 PM Lorenzo Fabbri via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> I have a matrix of descriptors and I want to use GetEuclideanDistMat to
> get the pairwise Euclidean distances. Once I compute it, I need to create a
> full matrix (number of compounds x number of compounds) from the 1D vector.
> I’m currently using
>
> lower = GetEuclideanDistMat(descriptors.values)
> n = int(np.sqrt(len(lower)*2)) + 1
> mask = np.tri(n, dtype=bool, k=-1)
> distances = np.zeros((n, n), dtype=float)
> distances[mask] = lower
>
> It seems to be working but I’m getting some weird results (very small
> distances for very different compounds), so I’m guessing I’m doing
> something wrong with the code above. Any suggestion?
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to handle metallocenes?

2019-10-07 Thread mikem
Hi Michal/Greg,


Many thanks for your thoughts.  Compounds are from PubChem's Substances. I'm of 
the opinion to filter out these types of molecules, but this may be hard to do 
with billions of compounds...?


What would be an efficient way to check parse drug like compounds, and reject 
organometallics.  Clearly checking every atom/bond is too expensive.


Best


Mike




Get Outlook for Android







On Mon, Oct 7, 2019 at 3:20 PM +0100, "Michal Krompiec" 
 wrote:










Dear Mike,
Try changing all metal-ligand bonds to "dative" or "ionic, and
standardize afterwards (but disable adjusting of implicit Hs). This
way, I was able to process (in KNIME) >99% of organometallics (incl.
metallocenes) downloaded from Reaxys.
Example snippet (which doesn't check the "directionality" of the bond, though):

from rdkit import Chem
import pandas as pd
metals=['Ti','Al','Mo','Ru','Co','Rh', 'Ir', 'Ni','Zr', 'Hf', 'W']
outmols=[]
mols=input_table['Molecule']
for mol in mols:
for bond in mol.GetBonds():
 if bond.GetEndAtom().GetSymbol() in metals or
bond.GetBeginAtom().GetSymbol() in metals:
  print("found metal-ligand bond")
  print("original type: "+ str(bond.GetBondType()))
  btype=Chem.rdchem.BondType.DATIVE
  bond.SetBondType(btype)
  print("changed to: "+
str(mol.GetBonds()[bond.GetIdx()].GetBondType()))
  try:

Chem.SanitizeMol(mol,sanitizeOps=Chem.SanitizeFlags.SANITIZE_ALL^Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
  except ValueError as ve:
  print("Sanitization failed")
  print(ve)
output_table = input_table.copy()

Best,
Michal



On Mon, 7 Oct 2019 at 13:45, Greg Landrum  wrote:
>
> Hi Mike,
>
> I think you mean "organometallics", not "metallocenes" (the two molecules in 
> that SDF is are coordination complexes, but neither is a metallocene; I 
> stopped looking after that). The compounds are also drawn in such a way that 
> they are chemically unreasonable. This is pretty typical for organometallics 
> in V2000 mol files.
>
> Unless you have a reliable source of input molecules and/or are willing to 
> look at every one, I would just filter anything that has a metal-nonmetal 
> bond out of the dataset.
>
> If you really want to do something with the molecules:
> The rdMolStandardize code, which is derived from MolVS, currently has one 
> approach for dealing with this type of complex: breaking all the covalent 
> bonds to the metal (this is also what InChI does). Given what a mess these 
> compounds are when they show up in most standard file formats, this seems 
> like a reasonable thing to do:
>
> In [4]: from rdkit import Chem
>
> In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize
>
> In [6]: dcon = rdMolStandardize.MetalDisconnector()
> [14:34:03] Initializing MetalDisconnector
>
> In [8]: suppl = 
> Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False)
>
> In [9]: m = suppl[0]
>
> In [10]: om = dcon.Disconnect(m)
> [14:34:29] Running MetalDisconnector
> [14:34:29] Removed covalent bond between Tc and O
> [14:34:29] Removed covalent bond between Tc and O
> [14:34:29] Removed covalent bond between Tc and S
> [14:34:29] Removed covalent bond between Tc and S
> [14:34:29] Removed covalent bond between Tc and P
> [14:34:29] Removed covalent bond between Tc and P
>
> In [11]: Chem.SanitizeMol(om)
> Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
>
> In [12]: Chem.MolToSmiles(om)
> Out[12]: 
> 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@@H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]'
>
>
> It's worth noting that this molecule is still a long way from making chemical 
> sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are not 
> sensible. So there's more manual fixing required here.
>
>
> Best,
> -greg
>
>
> On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz  wrote:
>>
>> Hello RDKit experts !
>>
>>
>>
>> Is there a function to handle metallocenes in the standardizer?
>>
>>
>>
>> I’ve enclosed some examples of compounds.
>>
>>
>>
>> Thanks,
>>
>> mike
>>
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss





___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to handle metallocenes?

2019-10-07 Thread Michal Krompiec
Dear Mike,
Try changing all metal-ligand bonds to "dative" or "ionic, and
standardize afterwards (but disable adjusting of implicit Hs). This
way, I was able to process (in KNIME) >99% of organometallics (incl.
metallocenes) downloaded from Reaxys.
Example snippet (which doesn't check the "directionality" of the bond, though):

from rdkit import Chem
import pandas as pd
metals=['Ti','Al','Mo','Ru','Co','Rh', 'Ir', 'Ni','Zr', 'Hf', 'W']
outmols=[]
mols=input_table['Molecule']
for mol in mols:
for bond in mol.GetBonds():
 if bond.GetEndAtom().GetSymbol() in metals or
bond.GetBeginAtom().GetSymbol() in metals:
  print("found metal-ligand bond")
  print("original type: "+ str(bond.GetBondType()))
  btype=Chem.rdchem.BondType.DATIVE
  bond.SetBondType(btype)
  print("changed to: "+
str(mol.GetBonds()[bond.GetIdx()].GetBondType()))
  try:

Chem.SanitizeMol(mol,sanitizeOps=Chem.SanitizeFlags.SANITIZE_ALL^Chem.SanitizeFlags.SANITIZE_ADJUSTHS)
  except ValueError as ve:
  print("Sanitization failed")
  print(ve)
output_table = input_table.copy()

Best,
Michal



On Mon, 7 Oct 2019 at 13:45, Greg Landrum  wrote:
>
> Hi Mike,
>
> I think you mean "organometallics", not "metallocenes" (the two molecules in 
> that SDF is are coordination complexes, but neither is a metallocene; I 
> stopped looking after that). The compounds are also drawn in such a way that 
> they are chemically unreasonable. This is pretty typical for organometallics 
> in V2000 mol files.
>
> Unless you have a reliable source of input molecules and/or are willing to 
> look at every one, I would just filter anything that has a metal-nonmetal 
> bond out of the dataset.
>
> If you really want to do something with the molecules:
> The rdMolStandardize code, which is derived from MolVS, currently has one 
> approach for dealing with this type of complex: breaking all the covalent 
> bonds to the metal (this is also what InChI does). Given what a mess these 
> compounds are when they show up in most standard file formats, this seems 
> like a reasonable thing to do:
>
> In [4]: from rdkit import Chem
>
> In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize
>
> In [6]: dcon = rdMolStandardize.MetalDisconnector()
> [14:34:03] Initializing MetalDisconnector
>
> In [8]: suppl = 
> Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False)
>
> In [9]: m = suppl[0]
>
> In [10]: om = dcon.Disconnect(m)
> [14:34:29] Running MetalDisconnector
> [14:34:29] Removed covalent bond between Tc and O
> [14:34:29] Removed covalent bond between Tc and O
> [14:34:29] Removed covalent bond between Tc and S
> [14:34:29] Removed covalent bond between Tc and S
> [14:34:29] Removed covalent bond between Tc and P
> [14:34:29] Removed covalent bond between Tc and P
>
> In [11]: Chem.SanitizeMol(om)
> Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
>
> In [12]: Chem.MolToSmiles(om)
> Out[12]: 
> 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@@H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]'
>
>
> It's worth noting that this molecule is still a long way from making chemical 
> sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are not 
> sensible. So there's more manual fixing required here.
>
>
> Best,
> -greg
>
>
> On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz  
> wrote:
>>
>> Hello RDKit experts !
>>
>>
>>
>> Is there a function to handle metallocenes in the standardizer?
>>
>>
>>
>> I’ve enclosed some examples of compounds.
>>
>>
>>
>> Thanks,
>>
>> mike
>>
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to handle metallocenes?

2019-10-07 Thread Greg Landrum
Hi Mike,

I think you mean "organometallics", not "metallocenes" (the two molecules
in that SDF is are coordination complexes, but neither is a metallocene; I
stopped looking after that). The compounds are also drawn in such a way
that they are chemically unreasonable. This is pretty typical for
organometallics in V2000 mol files.

Unless you have a reliable source of input molecules and/or are willing to
look at every one, I would just filter anything that has a metal-nonmetal
bond out of the dataset.

If you really want to do something with the molecules:
The rdMolStandardize code, which is derived from MolVS, currently has one
approach for dealing with this type of complex: breaking all the covalent
bonds to the metal (this is also what InChI does). Given what a mess these
compounds are when they show up in most standard file formats, this seems
like a reasonable thing to do:

In [4]: from rdkit import Chem



In [5]: from rdkit.Chem.MolStandardize import rdMolStandardize



In [6]: dcon = rdMolStandardize.MetalDisconnector()


[14:34:03] Initializing MetalDisconnector

In [8]: suppl =
Chem.SDMolSupplier('/home/glandrum/Downloads/RDKit_input.sdf',sanitize=False,removeHs=False)



In [9]: m = suppl[0]



In [10]: om = dcon.Disconnect(m)


[14:34:29] Running MetalDisconnector
[14:34:29] Removed covalent bond between Tc and O
[14:34:29] Removed covalent bond between Tc and O
[14:34:29] Removed covalent bond between Tc and S
[14:34:29] Removed covalent bond between Tc and S
[14:34:29] Removed covalent bond between Tc and P
[14:34:29] Removed covalent bond between Tc and P

In [11]: Chem.SanitizeMol(om)


Out[11]: rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE

In [12]: Chem.MolToSmiles(om)


Out[12]: 'CSCC[C@@H](NC(=O)[C@@H](CC(C)C)NC(=O)[C@
@H](Cc1cnc[nH]1)NC(=O)CNC(=O)[C@H](NC(=O)[C@@H](C)NC(=O)[C@H](CC(=O)[C@
@H](CCC(N)=O)NC(=O)NC(=O)C(CC[SH-]CCC[PH-](CO)CO)[SH-]CCC[PH-](CO)CO)c1cc2c2[nH]1)C(C)C)C(N)=O.[99Tc+9].[Cl-].[O-2].[O-2]'


It's worth noting that this molecule is still a long way from making
chemical sense : the +9 charge on the Tc and the [SH-] and [PH-] groups are
not sensible. So there's more manual fixing required here.


Best,
-greg


On Mon, Oct 7, 2019 at 12:06 PM Mike Mazanetz 
wrote:

> Hello RDKit experts !
>
>
>
> Is there a function to handle metallocenes in the standardizer?
>
>
>
> I’ve enclosed some examples of compounds.
>
>
>
> Thanks,
>
> mike
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] how to handle metallocenes?

2019-10-07 Thread Mike Mazanetz
Hello RDKit experts !

 

Is there a function to handle metallocenes in the standardizer?

 

I've enclosed some examples of compounds.

 

Thanks,

mike

 

 



RDKit_input.sdf
Description: chemical/mdl-sdfile
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Saving chains from PDB file

2019-10-07 Thread Maciek Wójcikowski
If you turn off the sanitization the splitting should be super fast too. If
that is the only thing you would like to do.

pon., 7 paź 2019, 10:31 użytkownik Téletchéa Stéphane <
stephane.teletc...@univ-nantes.fr> napisał:

> Le 05/10/2019 à 12:46, Chris Swain via Rdkit-discuss a écrit :
> > Hi,
> >
> > I have a number of PDB files (foo.pdb.gz) and I want to separate each
> chain in each file out into a separate file. So if a file contains 4 chains
> it will generate 4 separate files.
> >
> > Can I do this using RDKit, if so how?
> >
> > Cheers
> >
> > Chris
>
> Dear Chris,
>
> Even this could be performed in rdkit, I would recommend doing it using
> an external tool, for instance using Biopython and the Bio.PDB module
> (https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ),
> or even ProDy (http://prody.csb.pitt.edu/).
>
> Rdkit needs to wrap a lot of atom definitions to load the pdb file
> properly, and it takes time (minutes on my machine, which is a decent
> workstation :-).
> It will be lightning fast using Bio.PDB or prody, compared to rdkit.
>
> If you still want to use rdkit only, and need to reuse rdkit
> representation of the PDB file, then (c)pickle it (python2):
>
> import cPickle
> from rdkit import Chem
>
> def processReceptor(r):
>   try:
>  h=open('receptor.pkl','r')
>  receptor=cPickle.load(h)
>  h.close()
>except Exception as e:
>  receptor = Chem.MolFromPDBFile(r)
>  f=open('receptor.pkl','w')
>  cPickle.dump(receptor,f)
>  f.close()
>
>return receptor
>
> HTH,
>
> Stéphane
>
> --
> Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein
> Design In Silico
> UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322
> Nantes cedex 03, France
> Tél : +33 251 125 636 / Fax : +33 251 125 632
> http://www.ufip.univ-nantes.fr/ - http://www.steletch.org
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Saving chains from PDB file

2019-10-07 Thread Téletchéa Stéphane

Le 05/10/2019 à 12:46, Chris Swain via Rdkit-discuss a écrit :

Hi,

I have a number of PDB files (foo.pdb.gz) and I want to separate each chain in 
each file out into a separate file. So if a file contains 4 chains it will 
generate 4 separate files.

Can I do this using RDKit, if so how?

Cheers

Chris


Dear Chris,

Even this could be performed in rdkit, I would recommend doing it using 
an external tool, for instance using Biopython and the Bio.PDB module 
(https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ), 
or even ProDy (http://prody.csb.pitt.edu/).


Rdkit needs to wrap a lot of atom definitions to load the pdb file 
properly, and it takes time (minutes on my machine, which is a decent 
workstation :-).

It will be lightning fast using Bio.PDB or prody, compared to rdkit.

If you still want to use rdkit only, and need to reuse rdkit 
representation of the PDB file, then (c)pickle it (python2):


import cPickle
from rdkit import Chem

def processReceptor(r):
 try:
h=open('receptor.pkl','r')
receptor=cPickle.load(h)
h.close()
  except Exception as e:
receptor = Chem.MolFromPDBFile(r)
f=open('receptor.pkl','w')
cPickle.dump(receptor,f)
f.close()

  return receptor

HTH,

Stéphane

--
Assistant Professor in BioInformatics, UFIP, UMR 6286 CNRS, Team Protein 
Design In Silico
UFR Sciences et Techniques, 2, rue de la Houssinière, Bât. 25, 44322 
Nantes cedex 03, France

Tél : +33 251 125 636 / Fax : +33 251 125 632
http://www.ufip.univ-nantes.fr/ - http://www.steletch.org


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss