Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Jeff Saxon Wed, 02 Dec 2020 07:29:40 -0800

OK, thank you!
One question:
I have mentioned about it in the first topic: may your function also
be used to draw the 2D scratch of each ligand?  I am not sure if I can
take something from the data fille, that was created to store values
for Lipinski calculations. Need I consider SMILES column for it ??
That's how I did it in my first script (the problem was that it
consider 3D coordinates directly, while I was looking for the
possibility to convert it to 2D):


for conf in m:
if conf is None: continue
# draw ligand takin conformation directly from 3D file
Draw.MolToFile(conf,results+f'/{key}.png')

ср, 2 дек. 2020 г. в 15:44, Gustavo Seabra <gustavo.sea...@gmail.com>:
>
> Great,  I'm glad it works for you now.
>
> As for the fikes that don't work, you could try loading them individually to 
> look into them,  or save the molecules again.
>
> If you could share the molecules here, maybe someone could find what is the 
> problem. (I'd recommend starting a new thread for it)
>
> All the best,
> Gustavo.
>
> --
> Gustavo Seabra
>
> ________________________________
> From: Jeff Saxon <jmsstarli...@gmail.com>
> Sent: Wednesday, December 2, 2020 9:37:01 AM
> To: Gustavo Seabra <gustavo.sea...@gmail.com>; 
> rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net>
> Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
>
> Thank you again, Gustato!
>
> Here is how I adopted your script for multi-SDF filles:
> Note that I added directly to the script, a new datafile called 'All',
> into which I append each of the datafiles produced by your function
> using FOR loop ..
> Also I added TRY statement within FOR loop to ignore these two SDF
> caused a problem. However, I have no idea why they don't work (there
> are 2 filles from 1000, which in Pymol looks fine!)
>
>
> import subprocess, os, glob, shutil, sys
> import pandas as pd
>
> from rdkit import Chem, DataStructs
> from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors, 
> AllChem
> from IPython.display import HTML
>
> # the main function
> def load_sdf_file(file, key):
> """
> Reads molecules from an SDF file keeping only molecules
> with valid SMILES, and assign a source field
> """
> df = PandasTools.LoadSDF(file)
> df['LIGAND'] = key
> #df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles)
> df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
> df['HBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
> df['HBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
> df = df[['LIGAND','LogP','MolWt','HBA','HBD']]
> return df
>
>
> pwd = os.getcwd()
> filles='sdf'
> results='results'
> #set directory to analyse
> data = os.path.join(pwd,filles)
> #set directory with outputs
> results = os.path.join(pwd,results)
>
> os.chdir(data)
>
> all = pd.DataFrame()
> for sdf in dirlist:
> try:
> sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
> key = f'{sdf_name}'
> df = load_sdf_file(sdf,key)
> all = all.append(df,ignore_index = True)
> print(f'{sdf_name}.sdf has been processed')
> except:
> print(f'{sdf_name}.sdf has not been processed')
> # make a log of broken sdf filles
> with open(results+"/log.txt", "a") as log:
> log.write("%s has not been processed\n" %(key))
>
> ср, 2 дек. 2020 г. в 13:55, Gustavo Seabra <gustavo.sea...@gmail.com>:
> >
> > Yes, the way it is written it will only keep the last sdf file read. I can 
> > think of 2 options:
> >
> > 1. You can concatenate all sdfs into one,  multi-molecule file:
> > $ cat *.sdf > multi.sdf
> >
> > And read this one.
> >
> > 2. Alternatively,  instead of overwriting the final pandas dataframe every 
> > time, you can create one initial df then only concatenate it with the 
> > results of the function (see 
> > https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
> >
> > data = 
> > pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD])
> >
> > Then, for each file:
> > data = data.append(load_sdf_file(sdf,key))
> >
> > If possible, I believe option (1) should be faster.
> >
> >
> > As for the error you are seeing,  sometimes RDKit cannot read a molecule, 
> > so it returns no 'ROMol' object. It usually happens when the molecule is 
> > ill-defined. If you really need to read the molecules one-by-one, then you 
> > will need to treat this situation maybe with an 'if' statement in the 
> > function. If you read a multi-molecule sdf, it just ignores the molecules 
> > it can't read and keeps going.
> >
> > Ah, I dont think there is a function to use pdb files with Pandas. SDF is a 
> > better format for small molecules,  anyway.
> >
> > All the best,
> >
> > --
> > Gustavo Seabra
> >
> > ________________________________
> > From: Jeff Saxon <jmsstarli...@gmail.com>
> > Sent: Wednesday, December 2, 2020 4:53:05 AM
> > To: Gustavo Seabra <gustavo.sea...@gmail.com>; 
> > rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net>
> > Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
> >
> > Hey Gustavo,
> >
> > Thank you very much for your script!
> > I need to specify that I am working with many SDF filles, each of
> > which consist of one 3D structure of the ligand ( I don't see any
> > difference here between pdb, so if I can apply it on PDB directly it
> > would be rather better!!)
> >  Anyway I've just tried to adapt you script for my case
> >
> > # I simplify the function to take only 4 properties required for
> > lipinsky calculations,
> > # I also substitute Source on the name of the particular SDF file (See 
> > below)
> > def load_sdf_file(file, key):
> > """
> > Reads molecules from an SDF file keeping only molecules
> > with valid SMILES, and assign a source field
> > """
> > df = PandasTools.LoadSDF(file)
> > df['Source'] = key
> > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
> > df['LipinskyHBA'] = 
> > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
> > df['LipinskyHBD'] = 
> > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
> > df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']]
> > return df
> >
> >
> > pwd = os.getcwd()
> > filles='sdf'
> > results='results'
> > #set directory to analyse
> > data = os.path.join(pwd,filles)
> > #set directory with outputs
> > results = os.path.join(pwd,results)
> >
> > # go to the folder with all SDF filles
> > os.chdir(data)
> >
> > # loop each SDF and use it with the function
> > for sdf in dirlist:
> > sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
> > key = f'{sdf_name}'
> > df = load_sdf_file(sdf,key)
> > print(f'{sdf_name}.sdf has been processed')
> >
> > The problem is that it always stores the last line within DF, while I
> > need rather to append each processed SDF file. Also I've got an error
> > on one of the sdf file which interrupted the script:
> >
> > Traceback (most recent call last):
> >
> >   File "./lipinski2.py", line 67, in <module>
> >
> >     df = load_sdf_file(sdf,key)
> >
> >   File "./lipinski2.py", line 26, in load_sdf_file
> >
> >     df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> >
> >   File 
> > "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py",
> > line 2906, in __getitem__
> >
> >     indexer = self.columns.get_loc(key)
> >
> >   File 
> > "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py",
> > line 2897, in get_loc
> >
> >     raise KeyError(key) from err
> >
> > KeyError: 'ROMol'
> >
> > Probably some additional IF statement is required to ignore the file
> > in the case of "broken" SDF...
> >
> > вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra <gustavo.sea...@gmail.com>:
> > >
> > > Hi Jeff,
> > >
> > >
> > >
> > > There's a lot f people here with way more experience than me, so this may 
> > > not be the optimal solution... But here is what I would do in this case:
> > >
> > >
> > >
> > > from rdkit import Chem, DataStructs
> > >
> > > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors
> > >
> > > from IPython.display import HTML
> > >
> > >
> > >
> > > def load_sdf_file(file,source,id_column):
> > >
> > >     """
> > >
> > >     Reads molecules from an SDF file keeping only molecules
> > >
> > >     with valid SMILES, and assign a source field
> > >
> > >     """
> > >
> > >     df = PandasTools.LoadSDF(file)
> > >
> > >     df['Source'] = source
> > >
> > >     df['ID'] = df[id_column]
> > >
> > >     df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles)
> > >
> > >     df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)
> > >
> > >     df['MolWt']  = df['ROMol'].apply(Chem.Descriptors.MolWt)
> > >
> > >     df['LipinskyHBA'] = 
> > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
> > >
> > >     df['LipinskyHBD'] = 
> > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
> > >
> > >
> > >
> > >     df = 
> > > df[['Source','ID','SMILES','LogP','MolWt','LipinskyHBA','LipinskyHBD','ROMol']]
> > >
> > >     return df
> > >
> > >
> > >
> > > df = load_sdf_file("chembl-26_phase-1.sdf","ChEMBL_Phase-1","ID")
> > >
> > > df.head() #Should show the top of the DataFrame, with the properties and 
> > > the structures.
> > >
> > >
> > >
> > >
> > >
> > > All the best,
> > >
> > > --
> > >
> > > Gustavo Seabra
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jeff Saxon <jmsstarli...@gmail.com>
> > > Sent: Tuesday, December 1, 2020 7:35 AM
> > > To: rdkit-discuss@lists.sourceforge.net
> > > Subject: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
> > >
> > >
> > >
> > > Dear All,
> > >
> > >
> > >
> > > I've just started working with RDKIT focusing on the application of the 
> > > Lipinsky rule on the set of my ligands. Basically I take a 3D coordinates 
> > > of each ligand file (in SDF format) and then calculate for it required 4 
> > > properties Here is my code:
> > >
> > > # make a list of all .sdf filles present in data folder:
> > >
> > >     dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]
> > >
> > >
> > >
> > >     # create empty data file with 5 columns:
> > >
> > >     # name of the file,  value of variable p, value of ac, value of don, 
> > > value of wt
> > >
> > >     df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])
> > >
> > >
> > >
> > >     # for each sdf file get its name and calculate 4 different
> > >
> > > properties: p, ac, don, wt
> > >
> > > for sdf in dirlist:
> > >
> > > sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
> > >
> > > key = f'{sdf_name}'
> > >
> > > mol = open(sdf,'rb')
> > >
> > > m = Chem.ForwardSDMolSupplier(mol)
> > >
> > > for conf in m:
> > >
> > > if conf is None: continue
> > >
> > > p = MolLogP(conf) # coeff conc-perm
> > >
> > > ac = CalcNumLipinskiHBA(conf)#
> > >
> > > don = CalcNumLipinskiHBD(conf)
> > >
> > > wt = MolWt(conf)
> > >
> > > #two=AllChem.Compute2DCoords(conf)
> > >
> > > Draw.MolToFile(conf,results+f'/{key}.png')
> > >
> > > #df[key] = [p, ac, don, wt]
> > >
> > >
> > >
> > > Could you suggest how can I summarize the calculation of each ligand in 
> > > pandas-like DF and to then apply lipinsky filter on it?
> > >
> > > Is it possible to convert 3D coordinates to 2D in order that I could draw 
> > > it (presently it makes a sketch based on 3d coordinates directly from 
> > > SDF)?
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > >
> > > Rdkit-discuss mailing list
> > >
> > > Rdkit-discuss@lists.sourceforge.net
> > >
> > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Reply via email to