Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Gustavo Seabra Wed, 02 Dec 2020 04:57:23 -0800

Yes, the way it is written it will only keep the last sdf file read. I can 
think of 2 options:


1. You can concatenate all sdfs into one,  multi-molecule file:
$ cat *.sdf > multi.sdf

And read this one.

2. Alternatively,  instead of overwriting the final pandas dataframe every 
time, you can create one initial df then only concatenate it with the results 
of the function (see 
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

data = 
pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD])

Then, for each file:
data = data.append(load_sdf_file(sdf,key))

If possible, I believe option (1) should be faster.


As for the error you are seeing,  sometimes RDKit cannot read a molecule, so it 
returns no 'ROMol' object. It usually happens when the molecule is ill-defined. 
If you really need to read the molecules one-by-one, then you will need to 
treat this situation maybe with an 'if' statement in the function. If you read 
a multi-molecule sdf, it just ignores the molecules it can't read and keeps 
going.

Ah, I dont think there is a function to use pdb files with Pandas. SDF is a 
better format for small molecules,  anyway.

All the best,

--
Gustavo Seabra

________________________________
From: Jeff Saxon <jmsstarli...@gmail.com>
Sent: Wednesday, December 2, 2020 4:53:05 AM
To: Gustavo Seabra <gustavo.sea...@gmail.com>; 
rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Hey Gustavo,

Thank you very much for your script!
I need to specify that I am working with many SDF filles, each of
which consist of one 3D structure of the ligand ( I don't see any
difference here between pdb, so if I can apply it on PDB directly it
would be rather better!!)
 Anyway I've just tried to adapt you script for my case

# I simplify the function to take only 4 properties required for
lipinsky calculations,
# I also substitute Source on the name of the particular SDF file (See below)
def load_sdf_file(file, key):
"""
Reads molecules from an SDF file keeping only molecules
with valid SMILES, and assign a source field
"""
df = PandasTools.LoadSDF(file)
df['Source'] = key
df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
df['LipinskyHBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
df['LipinskyHBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']]
return df


pwd = os.getcwd()
filles='sdf'
results='results'
#set directory to analyse
data = os.path.join(pwd,filles)
#set directory with outputs
results = os.path.join(pwd,results)

# go to the folder with all SDF filles
os.chdir(data)

# loop each SDF and use it with the function
for sdf in dirlist:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
key = f'{sdf_name}'
df = load_sdf_file(sdf,key)
print(f'{sdf_name}.sdf has been processed')

The problem is that it always stores the last line within DF, while I
need rather to append each processed SDF file. Also I've got an error
on one of the sdf file which interrupted the script:

Traceback (most recent call last):

  File "./lipinski2.py", line 67, in <module>

    df = load_sdf_file(sdf,key)

  File "./lipinski2.py", line 26, in load_sdf_file

    df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)

  File 
"/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py",
line 2906, in __getitem__

    indexer = self.columns.get_loc(key)

  File 
"/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py",
line 2897, in get_loc

    raise KeyError(key) from err

KeyError: 'ROMol'

Probably some additional IF statement is required to ignore the file
in the case of "broken" SDF...

вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra <gustavo.sea...@gmail.com>:
>
> Hi Jeff,
>
>
>
> There's a lot f people here with way more experience than me, so this may not 
> be the optimal solution... But here is what I would do in this case:
>
>
>
> from rdkit import Chem, DataStructs
>
> from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors
>
> from IPython.display import HTML
>
>
>
> def load_sdf_file(file,source,id_column):
>
>     """
>
>     Reads molecules from an SDF file keeping only molecules
>
>     with valid SMILES, and assign a source field
>
>     """
>
>     df = PandasTools.LoadSDF(file)
>
>     df['Source'] = source
>
>     df['ID'] = df[id_column]
>
>     df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles)
>
>     df['LogP']   = df['ROMol'].apply(Chem.Descriptors.MolLogP)
>
>     df['MolWt']  = df['ROMol'].apply(Chem.Descriptors.MolWt)
>
>     df['LipinskyHBA'] = 
> df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
>
>     df['LipinskyHBD'] = 
> df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
>
>
>
>     df = 
> df[['Source','ID','SMILES','LogP','MolWt','LipinskyHBA','LipinskyHBD','ROMol']]
>
>     return df
>
>
>
> df = load_sdf_file("chembl-26_phase-1.sdf","ChEMBL_Phase-1","ID")
>
> df.head() #Should show the top of the DataFrame, with the properties and the 
> structures.
>
>
>
>
>
> All the best,
>
> --
>
> Gustavo Seabra
>
>
>
> -----Original Message-----
> From: Jeff Saxon <jmsstarli...@gmail.com>
> Sent: Tuesday, December 1, 2020 7:35 AM
> To: rdkit-discuss@lists.sourceforge.net
> Subject: [Rdkit-discuss] Applying Lipinsky filter on ligand data set
>
>
>
> Dear All,
>
>
>
> I've just started working with RDKIT focusing on the application of the 
> Lipinsky rule on the set of my ligands. Basically I take a 3D coordinates of 
> each ligand file (in SDF format) and then calculate for it required 4 
> properties Here is my code:
>
> # make a list of all .sdf filles present in data folder:
>
>     dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')]
>
>
>
>     # create empty data file with 5 columns:
>
>     # name of the file,  value of variable p, value of ac, value of don, 
> value of wt
>
>     df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"])
>
>
>
>     # for each sdf file get its name and calculate 4 different
>
> properties: p, ac, don, wt
>
> for sdf in dirlist:
>
> sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
>
> key = f'{sdf_name}'
>
> mol = open(sdf,'rb')
>
> m = Chem.ForwardSDMolSupplier(mol)
>
> for conf in m:
>
> if conf is None: continue
>
> p = MolLogP(conf) # coeff conc-perm
>
> ac = CalcNumLipinskiHBA(conf)#
>
> don = CalcNumLipinskiHBD(conf)
>
> wt = MolWt(conf)
>
> #two=AllChem.Compute2DCoords(conf)
>
> Draw.MolToFile(conf,results+f'/{key}.png')
>
> #df[key] = [p, ac, don, wt]
>
>
>
> Could you suggest how can I summarize the calculation of each ligand in 
> pandas-like DF and to then apply lipinsky filter on it?
>
> Is it possible to convert 3D coordinates to 2D in order that I could draw it 
> (presently it makes a sketch based on 3d coordinates directly from SDF)?
>
>
>
>
>
> _______________________________________________
>
> Rdkit-discuss mailing list
>
> Rdkit-discuss@lists.sourceforge.net
>
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set

Reply via email to