OK, thank you! One question: I have mentioned about it in the first topic: may your function also be used to draw the 2D scratch of each ligand? I am not sure if I can take something from the data fille, that was created to store values for Lipinski calculations. Need I consider SMILES column for it ?? That's how I did it in my first script (the problem was that it consider 3D coordinates directly, while I was looking for the possibility to convert it to 2D):
for conf in m: if conf is None: continue # draw ligand takin conformation directly from 3D file Draw.MolToFile(conf,results+f'/{key}.png') ср, 2 дек. 2020 г. в 15:44, Gustavo Seabra <gustavo.sea...@gmail.com>: > > Great, I'm glad it works for you now. > > As for the fikes that don't work, you could try loading them individually to > look into them, or save the molecules again. > > If you could share the molecules here, maybe someone could find what is the > problem. (I'd recommend starting a new thread for it) > > All the best, > Gustavo. > > -- > Gustavo Seabra > > ________________________________ > From: Jeff Saxon <jmsstarli...@gmail.com> > Sent: Wednesday, December 2, 2020 9:37:01 AM > To: Gustavo Seabra <gustavo.sea...@gmail.com>; > rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net> > Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set > > Thank you again, Gustato! > > Here is how I adopted your script for multi-SDF filles: > Note that I added directly to the script, a new datafile called 'All', > into which I append each of the datafiles produced by your function > using FOR loop .. > Also I added TRY statement within FOR loop to ignore these two SDF > caused a problem. However, I have no idea why they don't work (there > are 2 filles from 1000, which in Pymol looks fine!) > > > import subprocess, os, glob, shutil, sys > import pandas as pd > > from rdkit import Chem, DataStructs > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors, > AllChem > from IPython.display import HTML > > # the main function > def load_sdf_file(file, key): > """ > Reads molecules from an SDF file keeping only molecules > with valid SMILES, and assign a source field > """ > df = PandasTools.LoadSDF(file) > df['LIGAND'] = key > #df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles) > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) > df['HBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) > df['HBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) > df = df[['LIGAND','LogP','MolWt','HBA','HBD']] > return df > > > pwd = os.getcwd() > filles='sdf' > results='results' > #set directory to analyse > data = os.path.join(pwd,filles) > #set directory with outputs > results = os.path.join(pwd,results) > > os.chdir(data) > > all = pd.DataFrame() > for sdf in dirlist: > try: > sdf_name=sdf.rsplit( ".", 1 )[ 0 ] > key = f'{sdf_name}' > df = load_sdf_file(sdf,key) > all = all.append(df,ignore_index = True) > print(f'{sdf_name}.sdf has been processed') > except: > print(f'{sdf_name}.sdf has not been processed') > # make a log of broken sdf filles > with open(results+"/log.txt", "a") as log: > log.write("%s has not been processed\n" %(key)) > > ср, 2 дек. 2020 г. в 13:55, Gustavo Seabra <gustavo.sea...@gmail.com>: > > > > Yes, the way it is written it will only keep the last sdf file read. I can > > think of 2 options: > > > > 1. You can concatenate all sdfs into one, multi-molecule file: > > $ cat *.sdf > multi.sdf > > > > And read this one. > > > > 2. Alternatively, instead of overwriting the final pandas dataframe every > > time, you can create one initial df then only concatenate it with the > > results of the function (see > > https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) > > > > data = > > pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD]) > > > > Then, for each file: > > data = data.append(load_sdf_file(sdf,key)) > > > > If possible, I believe option (1) should be faster. > > > > > > As for the error you are seeing, sometimes RDKit cannot read a molecule, > > so it returns no 'ROMol' object. It usually happens when the molecule is > > ill-defined. If you really need to read the molecules one-by-one, then you > > will need to treat this situation maybe with an 'if' statement in the > > function. If you read a multi-molecule sdf, it just ignores the molecules > > it can't read and keeps going. > > > > Ah, I dont think there is a function to use pdb files with Pandas. SDF is a > > better format for small molecules, anyway. > > > > All the best, > > > > -- > > Gustavo Seabra > > > > ________________________________ > > From: Jeff Saxon <jmsstarli...@gmail.com> > > Sent: Wednesday, December 2, 2020 4:53:05 AM > > To: Gustavo Seabra <gustavo.sea...@gmail.com>; > > rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net> > > Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set > > > > Hey Gustavo, > > > > Thank you very much for your script! > > I need to specify that I am working with many SDF filles, each of > > which consist of one 3D structure of the ligand ( I don't see any > > difference here between pdb, so if I can apply it on PDB directly it > > would be rather better!!) > > Anyway I've just tried to adapt you script for my case > > > > # I simplify the function to take only 4 properties required for > > lipinsky calculations, > > # I also substitute Source on the name of the particular SDF file (See > > below) > > def load_sdf_file(file, key): > > """ > > Reads molecules from an SDF file keeping only molecules > > with valid SMILES, and assign a source field > > """ > > df = PandasTools.LoadSDF(file) > > df['Source'] = key > > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) > > df['LipinskyHBA'] = > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) > > df['LipinskyHBD'] = > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) > > df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']] > > return df > > > > > > pwd = os.getcwd() > > filles='sdf' > > results='results' > > #set directory to analyse > > data = os.path.join(pwd,filles) > > #set directory with outputs > > results = os.path.join(pwd,results) > > > > # go to the folder with all SDF filles > > os.chdir(data) > > > > # loop each SDF and use it with the function > > for sdf in dirlist: > > sdf_name=sdf.rsplit( ".", 1 )[ 0 ] > > key = f'{sdf_name}' > > df = load_sdf_file(sdf,key) > > print(f'{sdf_name}.sdf has been processed') > > > > The problem is that it always stores the last line within DF, while I > > need rather to append each processed SDF file. Also I've got an error > > on one of the sdf file which interrupted the script: > > > > Traceback (most recent call last): > > > > File "./lipinski2.py", line 67, in <module> > > > > df = load_sdf_file(sdf,key) > > > > File "./lipinski2.py", line 26, in load_sdf_file > > > > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > > > > File > > "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py", > > line 2906, in __getitem__ > > > > indexer = self.columns.get_loc(key) > > > > File > > "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py", > > line 2897, in get_loc > > > > raise KeyError(key) from err > > > > KeyError: 'ROMol' > > > > Probably some additional IF statement is required to ignore the file > > in the case of "broken" SDF... > > > > вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra <gustavo.sea...@gmail.com>: > > > > > > Hi Jeff, > > > > > > > > > > > > There's a lot f people here with way more experience than me, so this may > > > not be the optimal solution... But here is what I would do in this case: > > > > > > > > > > > > from rdkit import Chem, DataStructs > > > > > > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors > > > > > > from IPython.display import HTML > > > > > > > > > > > > def load_sdf_file(file,source,id_column): > > > > > > """ > > > > > > Reads molecules from an SDF file keeping only molecules > > > > > > with valid SMILES, and assign a source field > > > > > > """ > > > > > > df = PandasTools.LoadSDF(file) > > > > > > df['Source'] = source > > > > > > df['ID'] = df[id_column] > > > > > > df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles) > > > > > > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > > > > > > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) > > > > > > df['LipinskyHBA'] = > > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) > > > > > > df['LipinskyHBD'] = > > > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) > > > > > > > > > > > > df = > > > df[['Source','ID','SMILES','LogP','MolWt','LipinskyHBA','LipinskyHBD','ROMol']] > > > > > > return df > > > > > > > > > > > > df = load_sdf_file("chembl-26_phase-1.sdf","ChEMBL_Phase-1","ID") > > > > > > df.head() #Should show the top of the DataFrame, with the properties and > > > the structures. > > > > > > > > > > > > > > > > > > All the best, > > > > > > -- > > > > > > Gustavo Seabra > > > > > > > > > > > > -----Original Message----- > > > From: Jeff Saxon <jmsstarli...@gmail.com> > > > Sent: Tuesday, December 1, 2020 7:35 AM > > > To: rdkit-discuss@lists.sourceforge.net > > > Subject: [Rdkit-discuss] Applying Lipinsky filter on ligand data set > > > > > > > > > > > > Dear All, > > > > > > > > > > > > I've just started working with RDKIT focusing on the application of the > > > Lipinsky rule on the set of my ligands. Basically I take a 3D coordinates > > > of each ligand file (in SDF format) and then calculate for it required 4 > > > properties Here is my code: > > > > > > # make a list of all .sdf filles present in data folder: > > > > > > dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')] > > > > > > > > > > > > # create empty data file with 5 columns: > > > > > > # name of the file, value of variable p, value of ac, value of don, > > > value of wt > > > > > > df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"]) > > > > > > > > > > > > # for each sdf file get its name and calculate 4 different > > > > > > properties: p, ac, don, wt > > > > > > for sdf in dirlist: > > > > > > sdf_name=sdf.rsplit( ".", 1 )[ 0 ] > > > > > > key = f'{sdf_name}' > > > > > > mol = open(sdf,'rb') > > > > > > m = Chem.ForwardSDMolSupplier(mol) > > > > > > for conf in m: > > > > > > if conf is None: continue > > > > > > p = MolLogP(conf) # coeff conc-perm > > > > > > ac = CalcNumLipinskiHBA(conf)# > > > > > > don = CalcNumLipinskiHBD(conf) > > > > > > wt = MolWt(conf) > > > > > > #two=AllChem.Compute2DCoords(conf) > > > > > > Draw.MolToFile(conf,results+f'/{key}.png') > > > > > > #df[key] = [p, ac, don, wt] > > > > > > > > > > > > Could you suggest how can I summarize the calculation of each ligand in > > > pandas-like DF and to then apply lipinsky filter on it? > > > > > > Is it possible to convert 3D coordinates to 2D in order that I could draw > > > it (presently it makes a sketch based on 3d coordinates directly from > > > SDF)? > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Rdkit-discuss mailing list > > > > > > Rdkit-discuss@lists.sourceforge.net > > > > > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss