Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Gustavo Seabra Wed, 02 Dec 2020 06:45:54 -0800

Great,  I'm glad it works for you now.

As for the fikes that don't work, you could try loading them individually to 
look into them,  or save the molecules again.


If you could share the molecules here, maybe someone could find what is the 
problem. (I'd recommend starting a new thread for it)

All the best,
Gustavo.

--
Gustavo Seabra

________________________________
From: Jeff Saxon <[email protected]>
Sent: Wednesday, December 2, 2020 9:37:01 AM
To: Gustavo Seabra <[email protected]>; 
[email protected] <[email protected]>
Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Thank you again, Gustato!

Here is how I adopted your script for multi-SDF filles:
Note that I added directly to the script, a new datafile called 'All',
into which I append each of the datafiles produced by your function
using FOR loop ..
Also I added TRY statement within FOR loop to ignore these two SDF
caused a problem. However, I have no idea why they don't work (there
are 2 filles from 1000, which in Pymol looks fine!)


import subprocess, os, glob, shutil, sys
import pandas as pd

from rdkit import Chem, DataStructs
from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors, AllChem
from IPython.display import HTML

# the main function
def load_sdf_file(file, key):
"""
Reads molecules from an SDF file keeping only molecules
with valid SMILES, and assign a source field
"""
df = PandasTools.LoadSDF(file)
df['LIGAND'] = key
#df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles)
df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
df['HBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
df['HBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
df = df[['LIGAND','LogP','MolWt','HBA','HBD']]
return df


pwd = os.getcwd()
filles='sdf'
results='results'
#set directory to analyse
data = os.path.join(pwd,filles)
#set directory with outputs
results = os.path.join(pwd,results)

os.chdir(data)

all = pd.DataFrame()
for sdf in dirlist:
try:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
key = f'{sdf_name}'
df = load_sdf_file(sdf,key)
all = all.append(df,ignore_index = True)
print(f'{sdf_name}.sdf has been processed')
except:
print(f'{sdf_name}.sdf has not been processed')
# make a log of broken sdf filles
with open(results+"/log.txt", "a") as log:
log.write("%s has not been processed\n" %(key))

ср, 2 дек. 2020 г. в 13:55, Gustavo Seabra <[email protected]>:
>
> Yes, the way it is written it will only keep the last sdf file read. I can 
> think of 2 options:
>
> 1. You can concatenate all sdfs into one,  multi-molecule file:
> $ cat *.sdf > multi.sdf
>
> And read this one.
>
> 2. Alternatively,  instead of overwriting the final pandas dataframe every 
> time, you can create one initial df then only concatenate it with the results 
> of the function (see 
> https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
>
> data = 
> pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD])
>
> Then, for each file:
> data = data.append(load_sdf_file(sdf,key))
>
> If possible, I believe option (1) should be faster.
>
>
> As for the error you are seeing,  sometimes RDKit cannot read a molecule, so 
> it returns no 'ROMol' object. It usually happens when the molecule is 
> ill-defined. If you really need to read the molecules one-by-one, then you 
> will need to treat this situation maybe with an 'if' statement in the 
> function. If you read a multi-molecule sdf, it just ignores the molecules it 
> can't read and keeps going.
>
> Ah, I dont think there is a function to use pdb files with Pandas. SDF is a 
> better format for small molecules,  anyway.
>
> All the best,
>
> --
> Gustavo Seabra
>
> ________________________________
> From: Jeff Saxon <[email protected]>
> Sent: Wednesday, December 2, 2020 4:53:05 AM
> To: Gustavo Seabra <[email protected]>; 
> [email protected] <[email protected]>
> Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
>
> Hey Gustavo,
>
> Thank you very much for your script!
> I need to specify that I am working with many SDF filles, each of
> which consist of one 3D structure of the ligand ( I don't see any
> difference here between pdb, so if I can apply it on PDB directly it
> would be rather better!!)
>  Anyway I've just tried to adapt you script for my case
>
> # I simplify the function to take only 4 properties required for
> lipinsky calculations,
> # I also substitute Source on the name of the particular SDF file (See below)
> def load_sdf_file(file, key):
> """
> Reads molecules from an SDF file keeping only molecules
> with valid SMILES, and assign a source field
> """
> df = PandasTools.LoadSDF(file)
> df['Source'] = key
> df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
> df['LipinskyHBA'] = 
> df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
> df['LipinskyHBD'] = 
> df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
> df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']]
> return df
>
>
> pwd = os.getcwd()
> filles='sdf'
> results='results'
> #set directory to analyse
> data = os.path.join(pwd,filles)
> #set directory with outputs
> results = os.path.join(pwd,results)
>
> # go to the folder with all SDF filles
> os.chdir(data)
>
> # loop each SDF and use it with the function
> for sdf in dirlist:
> sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
> key = f'{sdf_name}'
> df = load_sdf_file(sdf,key)
> print(f'{sdf_name}.sdf has been processed')
>
> The problem is that it always stores the last line within DF, while I
> need rather to append each processed SDF file. Also I've got an error
> on one of the sdf file which interrupted the script:
>
> Traceback (most recent call last):
>
>   File "./lipinski2.py", line 67, in <module>
>
>     df = load_sdf_file(sdf,key)
>
>   File "./lipinski2.py", line 26, in load_sdf_file
>
>     df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)
>
>   File 
> "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py",
> line 2906, in __getitem__
>
>     indexer = self.columns.get_loc(key)
>
>   File 
> "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py",
> line 2897, in get_loc
>
>     raise KeyError(key) from err
>
> KeyError: 'ROMol'
>
> Probably some additional IF statement is required to ignore the file
> in the case of "broken" SDF...
>
> вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra <[email protected]>:
> >
> > Hi Jeff,
> >
> >
> >
> > There's a lot f people here with way more experience than me, so this may 
> > not be the optimal solution... But here is what I would do in this case:
> >
> >
> >
> > from rdkit import Chem, DataStructs
> >
> > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors
> >
> > from IPython.display import HTML
> >
> >
> >
> > def load_sdf_file(file,source,id_column):
> >
> >     """
> >
> >     Reads molecules from an SDF file keeping only molecules
> >
> >     with valid SMILES, and assign a source field
> >
> >     """
> >
> >     df = PandasTools.LoadSDF(file)
> >
> >     df['Source'] = source
> >
> >     df['ID'] = df[id_column]
> >
> >     df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles)
> >
> >     df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> >
> >     df['MolWt']  = df['ROMol'].apply(Chem.Descriptors.MolWt)
> >
> >     df['LipinskyHBA'] = 
> > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
> >
> >     df['LipinskyHBD'] = 
> > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
> >
> >
> >
> >     df = 
> > df[['Source','ID','SMILES','LogP','MolWt','LipinskyHBA','LipinskyHBD','ROMol']]
> >
> >     return df
> >
> >
> >
> > df = load_sdf_file("chembl-26_phase-1.sdf","ChEMBL_Phase-1","ID")
> >
> > df.head() #Should show the top of the DataFrame, with the properties and 
> > the structures.
> >
> >
> >
> >
> >
> > All the best,
> >
> > --
> >
> > Gustavo Seabra
> >
> >
> >
> > -----Original Message-----
> > From: Jeff Saxon <[email protected]>
> > Sent: Tuesday, December 1, 2020 7:35 AM
> > To: [email protected]
> > Subject: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
> >
> >
> >
> > Dear All,
> >
> >
> >
> > I've just started working with RDKIT focusing on the application of the 
> > Lipinsky rule on the set of my ligands. Basically I take a 3D coordinates 
> > of each ligand file (in SDF format) and then calculate for it required 4 
> > properties Here is my code:
> >
> > # make a list of all .sdf filles present in data folder:
> >
> >     dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]
> >
> >
> >
> >     # create empty data file with 5 columns:
> >
> >     # name of the file,  value of variable p, value of ac, value of don, 
> > value of wt
> >
> >     df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])
> >
> >
> >
> >     # for each sdf file get its name and calculate 4 different
> >
> > properties: p, ac, don, wt
> >
> > for sdf in dirlist:
> >
> > sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
> >
> > key = f'{sdf_name}'
> >
> > mol = open(sdf,'rb')
> >
> > m = Chem.ForwardSDMolSupplier(mol)
> >
> > for conf in m:
> >
> > if conf is None: continue
> >
> > p = MolLogP(conf) # coeff conc-perm
> >
> > ac = CalcNumLipinskiHBA(conf)#
> >
> > don = CalcNumLipinskiHBD(conf)
> >
> > wt = MolWt(conf)
> >
> > #two=AllChem.Compute2DCoords(conf)
> >
> > Draw.MolToFile(conf,results+f'/{key}.png')
> >
> > #df[key] = [p, ac, don, wt]
> >
> >
> >
> > Could you suggest how can I summarize the calculation of each ligand in 
> > pandas-like DF and to then apply lipinsky filter on it?
> >
> > Is it possible to convert 3D coordinates to 2D in order that I could draw 
> > it (presently it makes a sketch based on 3d coordinates directly from SDF)?
> >
> >
> >
> >
> >
> > _______________________________________________
> >
> > Rdkit-discuss mailing list
> >
> > [email protected]
> >
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Reply via email to