Yes, the way it is written it will only keep the last sdf file read. I can think of 2 options:
1. You can concatenate all sdfs into one, multi-molecule file: $ cat *.sdf > multi.sdf And read this one. 2. Alternatively, instead of overwriting the final pandas dataframe every time, you can create one initial df then only concatenate it with the results of the function (see https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) data = pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD]) Then, for each file: data = data.append(load_sdf_file(sdf,key)) If possible, I believe option (1) should be faster. As for the error you are seeing, sometimes RDKit cannot read a molecule, so it returns no 'ROMol' object. It usually happens when the molecule is ill-defined. If you really need to read the molecules one-by-one, then you will need to treat this situation maybe with an 'if' statement in the function. If you read a multi-molecule sdf, it just ignores the molecules it can't read and keeps going. Ah, I dont think there is a function to use pdb files with Pandas. SDF is a better format for small molecules, anyway. All the best, -- Gustavo Seabra ________________________________ From: Jeff Saxon <jmsstarli...@gmail.com> Sent: Wednesday, December 2, 2020 4:53:05 AM To: Gustavo Seabra <gustavo.sea...@gmail.com>; rdkit-discuss@lists.sourceforge.net <rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] Applying Lipinsky filter on ligand data set Hey Gustavo, Thank you very much for your script! I need to specify that I am working with many SDF filles, each of which consist of one 3D structure of the ligand ( I don't see any difference here between pdb, so if I can apply it on PDB directly it would be rather better!!) Anyway I've just tried to adapt you script for my case # I simplify the function to take only 4 properties required for lipinsky calculations, # I also substitute Source on the name of the particular SDF file (See below) def load_sdf_file(file, key): """ Reads molecules from an SDF file keeping only molecules with valid SMILES, and assign a source field """ df = PandasTools.LoadSDF(file) df['Source'] = key df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) df['LipinskyHBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) df['LipinskyHBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']] return df pwd = os.getcwd() filles='sdf' results='results' #set directory to analyse data = os.path.join(pwd,filles) #set directory with outputs results = os.path.join(pwd,results) # go to the folder with all SDF filles os.chdir(data) # loop each SDF and use it with the function for sdf in dirlist: sdf_name=sdf.rsplit( ".", 1 )[ 0 ] key = f'{sdf_name}' df = load_sdf_file(sdf,key) print(f'{sdf_name}.sdf has been processed') The problem is that it always stores the last line within DF, while I need rather to append each processed SDF file. Also I've got an error on one of the sdf file which interrupted the script: Traceback (most recent call last): File "./lipinski2.py", line 67, in <module> df = load_sdf_file(sdf,key) File "./lipinski2.py", line 26, in load_sdf_file df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) File "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/frame.py", line 2906, in __getitem__ indexer = self.columns.get_loc(key) File "/Users/gleb/opt/miniconda3/envs/my-rdkit-env/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc raise KeyError(key) from err KeyError: 'ROMol' Probably some additional IF statement is required to ignore the file in the case of "broken" SDF... вт, 1 дек. 2020 г. в 19:07, Gustavo Seabra <gustavo.sea...@gmail.com>: > > Hi Jeff, > > > > There's a lot f people here with way more experience than me, so this may not > be the optimal solution... But here is what I would do in this case: > > > > from rdkit import Chem, DataStructs > > from rdkit.Chem import Draw, PandasTools, Descriptors, rdMolDescriptors > > from IPython.display import HTML > > > > def load_sdf_file(file,source,id_column): > > """ > > Reads molecules from an SDF file keeping only molecules > > with valid SMILES, and assign a source field > > """ > > df = PandasTools.LoadSDF(file) > > df['Source'] = source > > df['ID'] = df[id_column] > > df['SMILES'] = df['ROMol'].apply(Chem.MolToSmiles) > > df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP) > > df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt) > > df['LipinskyHBA'] = > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA) > > df['LipinskyHBD'] = > df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD) > > > > df = > df[['Source','ID','SMILES','LogP','MolWt','LipinskyHBA','LipinskyHBD','ROMol']] > > return df > > > > df = load_sdf_file("chembl-26_phase-1.sdf","ChEMBL_Phase-1","ID") > > df.head() #Should show the top of the DataFrame, with the properties and the > structures. > > > > > > All the best, > > -- > > Gustavo Seabra > > > > -----Original Message----- > From: Jeff Saxon <jmsstarli...@gmail.com> > Sent: Tuesday, December 1, 2020 7:35 AM > To: rdkit-discuss@lists.sourceforge.net > Subject: [Rdkit-discuss] Applying Lipinsky filter on ligand data set > > > > Dear All, > > > > I've just started working with RDKIT focusing on the application of the > Lipinsky rule on the set of my ligands. Basically I take a 3D coordinates of > each ligand file (in SDF format) and then calculate for it required 4 > properties Here is my code: > > # make a list of all .sdf filles present in data folder: > > dirlist = [os.path.basename(p) for p in glob.glob('data' + '/*.sdf')] > > > > # create empty data file with 5 columns: > > # name of the file, value of variable p, value of ac, value of don, > value of wt > > df = pd.DataFrame(columns=["key", "p", "ac", "don", "wt"]) > > > > # for each sdf file get its name and calculate 4 different > > properties: p, ac, don, wt > > for sdf in dirlist: > > sdf_name=sdf.rsplit( ".", 1 )[ 0 ] > > key = f'{sdf_name}' > > mol = open(sdf,'rb') > > m = Chem.ForwardSDMolSupplier(mol) > > for conf in m: > > if conf is None: continue > > p = MolLogP(conf) # coeff conc-perm > > ac = CalcNumLipinskiHBA(conf)# > > don = CalcNumLipinskiHBD(conf) > > wt = MolWt(conf) > > #two=AllChem.Compute2DCoords(conf) > > Draw.MolToFile(conf,results+f'/{key}.png') > > #df[key] = [p, ac, don, wt] > > > > Could you suggest how can I summarize the calculation of each ligand in > pandas-like DF and to then apply lipinsky filter on it? > > Is it possible to convert 3D coordinates to 2D in order that I could draw it > (presently it makes a sketch based on 3d coordinates directly from SDF)? > > > > > > _______________________________________________ > > Rdkit-discuss mailing list > > Rdkit-discuss@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss