[Rdkit-discuss] I encountered some issues while using RDKit

[QQ????] ???????? via Rdkit-discuss Wed, 12 Apr 2023 20:22:04 -0700

Hello Greg,
If this is not the first time you have seen this email, I'm sorry, please 
ignore it. Due to network issues, I have tried sending several times. And I 
found that when sending attachments, the email status was always in the process 
of being delivered, so I simplified most of the content into emails.
I am from the TCMSP database??&nbsp;https://old.tcmsp-e.com/tcmsp.php&nbsp;??I 
downloaded the Mol2 file of Ingredient and used OpenBabel to batch convert it 
to. miles files. Then, I used Python to read the molecular formula and RDKit to 
standardize it. The error message is shown in Fig1.
After RDKit reported an error, I was able to open the file "MOL000001. mol2" 
using BIOVIA Discovery Studio 2020, as shown in Fig2.
Its information is as follows??
&gt;ID:MOL000001
&gt;Name:anthocyanidin
&gt;CAS:84082-34-8
In the TCMSP database, I can obtain the structural formula of the corresponding 
molecule, as shown in Figure 3.
The code in this error is in Fig4:
&gt;import pandas as pd
&gt;import numpy as np
&gt;import rdkit.Chem as Chem
&gt;from rdkit.Chem.MolStandardize import rdMolStandardize
&gt;TCMSP_ingredients=pd.read_csv('./Network 
Pharmacology/TCMSP-Spider-main/data/sample_data/ingredients_data.csv',encoding='gb18030')
&gt;TCMSP_ingredients
&gt;def Standardize(ingredient_id):
&gt;&gt;mol=Chem.MolFromMol2File('./Network 
Pharmacology/TCMSP_MOL/{}.mol2'.format(ingredient_id))
&gt;&gt;# removeHs, disconnect metal atoms, normalize the molecule, reionize 
the molecule
&gt;&gt;clean_mol = rdMolStandardize.Cleanup(mol)
&gt;&gt;# if many fragments, get the "parent" (the actual mol we are interested 
in)
&gt;&gt;parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
&gt;&gt;# try to neutralize molecule
&gt;&gt;uncharger = rdMolStandardize.Uncharger()&nbsp; # annoying, but 
necessary as no convenience method exists
&gt;&gt;uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
&gt;&gt;# note that no attempt is made at reionization at this step
&gt;&gt;# nor at ionization at some pH (rdkit has no pKa caculator)
&gt;&gt;# the main aim to represent all molecules from different sources
&gt;&gt;# in a (single) standard way, for use in ML, catalogue, etc.
&gt;&gt;te = rdMolStandardize.TautomerEnumerator()&nbsp; # idem
&gt;&gt;taut_uncharged_parent_clean_mol = 
te.Canonicalize(uncharged_parent_clean_mol)
&gt;&gt;return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
&gt;def read_smiles(path,ingredient_id):
&gt;&gt;with open (path+'{}.smiles'.format(ingredient_id)) as file:
&gt;&gt;&gt;smiles=''
&gt;&gt;&gt;for line in file:
&gt;&gt;&gt;&gt;line=line.replace('\n','')
&gt;&gt;&gt;&gt;line_list=line.split('  ')
&gt;&gt;&gt;&gt;smiles=line_list[0]
&gt;&gt;&gt;&gt;return smiles


#there is only one line in smiles file.
&gt;def smiles_Standardize(smiles):
&gt;&gt;mol=Chem.MolFromSmiles(smiles)
&gt;&gt;smiles=Chem.MolToSmiles(mol,isomericSmiles=True,canonical=True)
&gt;&gt;return smiles

&gt;ingredient_ids=TCMSP_ingredients.iloc[:,0].tolist()
&gt;SMILES=[]
&gt;i=-1
&gt;path='./Network Pharmacology/TCMSP_MOL/'
&gt;for ingredient_id in ingredient_ids:
&gt;&gt;i+=1
&gt;&gt;print(ingredient_id)
&gt;&gt;print(i)
&gt;&gt;# smiles=Standardize(ingredient_id)
&gt;&gt;smiles=read_smiles(path,ingredient_id)
&gt;&gt;smiles=smiles_Standardize(smiles)
&gt;&gt;SMILES.append(smiles)



In the final code block, I commented out the sentence 'smiles=Standardize 
(ingredient_id)' and used the
??smiles=read_smiles(path,ingredient_id)
smiles=smiles_Standardize (smiles)".

Then I changed the processing format for the same data and used RDKit for 
processing until it stopped at "MOL000107". The error message is shown in Fig5.
The code in this error is in Fig4:
In the final code block, I commented out the sentence
'smiles=read_smiles(path,ingredient_id)
smiles=smiles_ Standardize (smiles)'
and used the ??smiles=Standardize (ingredient_id)".


I can obtain MOL000107 related information from TCMSP, as shown in Fig6
Its information is as follows??
&gt;ID:MOL000107
&gt;Name:quercertin,3-o-beta-d-glucopyranoside
&gt;CAS:482-35-9
At the same time, I attempted to use the
&gt;try:
&gt;except:
Ignoring the structure and reporting errors, RDKit reported errors more than 
800 times in over 14000 TCMSP indexed Ingredints


And, I can obtain the standardized SMILES structural formula of MOL000001 from 
RDKit:
[O-]c1cc2c(O)cc(O)cc2[o+]c1-c1c[c][c][c]c1
The structural formula of MOL000107 is shown in Fig7
This process seems strange, and there should be some degree of inconsistency 
between the two functions of RDKit.
At the same time, I also used the standardization process described above to 
parse the chemical structural formula of the 
ChEMBL(https://www.ebi.ac.uk/chembl/) database. After processing hundreds of 
thousands of molecules, it encountered an error, as shown in Figure 8.
I obtained a download file from its official website to standardize the 
compounds in the "compound_structures" table one by one. After processing 
hundreds of thousands of molecules, an error occurred. However, I'm sorry that 
I didn't keep the error message and running it again would take a lot of 
time.&nbsp;But I remember the error type as it appeared before, either 
'C++Signature' or 'NoneType Has no attribute Getatoms' or' Molecule is None '.




-Best,
Wang Jialuo





?????? WangJiaLuo
????????????????????????????????????????????85???????????????????? ??????117004
Address:Shenyang Pharmaceutical University, 85 Hongliu Rd., Benxi City, 
Liaoning Province, 117004, P.R.China

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] I encountered some issues while using RDKit

Reply via email to