Hello Greg,
If this is not the first time you have seen this email, I'm sorry, please
ignore it. Due to network issues, I have tried sending several times. And I
found that when sending attachments, the email status was always in the process
of being delivered, so I simplified most of the content into emails.
I am from the TCMSP database?? https://old.tcmsp-e.com/tcmsp.php ??I
downloaded the Mol2 file of Ingredient and used OpenBabel to batch convert it
to. miles files. Then, I used Python to read the molecular formula and RDKit to
standardize it. The error message is shown in Fig1.
After RDKit reported an error, I was able to open the file "MOL000001. mol2"
using BIOVIA Discovery Studio 2020, as shown in Fig2.
Its information is as follows??
>ID:MOL000001
>Name:anthocyanidin
>CAS:84082-34-8
In the TCMSP database, I can obtain the structural formula of the corresponding
molecule, as shown in Figure 3.
The code in this error is in Fig4:
>import pandas as pd
>import numpy as np
>import rdkit.Chem as Chem
>from rdkit.Chem.MolStandardize import rdMolStandardize
>TCMSP_ingredients=pd.read_csv('./Network
Pharmacology/TCMSP-Spider-main/data/sample_data/ingredients_data.csv',encoding='gb18030')
>TCMSP_ingredients
>def Standardize(ingredient_id):
>>mol=Chem.MolFromMol2File('./Network
Pharmacology/TCMSP_MOL/{}.mol2'.format(ingredient_id))
>># removeHs, disconnect metal atoms, normalize the molecule, reionize
the molecule
>>clean_mol = rdMolStandardize.Cleanup(mol)
>># if many fragments, get the "parent" (the actual mol we are interested
in)
>>parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
>># try to neutralize molecule
>>uncharger = rdMolStandardize.Uncharger() # annoying, but
necessary as no convenience method exists
>>uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
>># note that no attempt is made at reionization at this step
>># nor at ionization at some pH (rdkit has no pKa caculator)
>># the main aim to represent all molecules from different sources
>># in a (single) standard way, for use in ML, catalogue, etc.
>>te = rdMolStandardize.TautomerEnumerator() # idem
>>taut_uncharged_parent_clean_mol =
te.Canonicalize(uncharged_parent_clean_mol)
>>return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
>def read_smiles(path,ingredient_id):
>>with open (path+'{}.smiles'.format(ingredient_id)) as file:
>>>smiles=''
>>>for line in file:
>>>>line=line.replace('\n','')
>>>>line_list=line.split(' ')
>>>>smiles=line_list[0]
>>>>return smiles
#there is only one line in smiles file.
>def smiles_Standardize(smiles):
>>mol=Chem.MolFromSmiles(smiles)
>>smiles=Chem.MolToSmiles(mol,isomericSmiles=True,canonical=True)
>>return smiles
>ingredient_ids=TCMSP_ingredients.iloc[:,0].tolist()
>SMILES=[]
>i=-1
>path='./Network Pharmacology/TCMSP_MOL/'
>for ingredient_id in ingredient_ids:
>>i+=1
>>print(ingredient_id)
>>print(i)
>># smiles=Standardize(ingredient_id)
>>smiles=read_smiles(path,ingredient_id)
>>smiles=smiles_Standardize(smiles)
>>SMILES.append(smiles)
In the final code block, I commented out the sentence 'smiles=Standardize
(ingredient_id)' and used the
??smiles=read_smiles(path,ingredient_id)
smiles=smiles_Standardize (smiles)".
Then I changed the processing format for the same data and used RDKit for
processing until it stopped at "MOL000107". The error message is shown in Fig5.
The code in this error is in Fig4:
In the final code block, I commented out the sentence
'smiles=read_smiles(path,ingredient_id)
smiles=smiles_ Standardize (smiles)'
and used the ??smiles=Standardize (ingredient_id)".
I can obtain MOL000107 related information from TCMSP, as shown in Fig6
Its information is as follows??
>ID:MOL000107
>Name:quercertin,3-o-beta-d-glucopyranoside
>CAS:482-35-9
At the same time, I attempted to use the
>try:
>except:
Ignoring the structure and reporting errors, RDKit reported errors more than
800 times in over 14000 TCMSP indexed Ingredints
And, I can obtain the standardized SMILES structural formula of MOL000001 from
RDKit:
[O-]c1cc2c(O)cc(O)cc2[o+]c1-c1c[c][c][c]c1
The structural formula of MOL000107 is shown in Fig7
This process seems strange, and there should be some degree of inconsistency
between the two functions of RDKit.
At the same time, I also used the standardization process described above to
parse the chemical structural formula of the
ChEMBL(https://www.ebi.ac.uk/chembl/) database. After processing hundreds of
thousands of molecules, it encountered an error, as shown in Figure 8.
I obtained a download file from its official website to standardize the
compounds in the "compound_structures" table one by one. After processing
hundreds of thousands of molecules, an error occurred. However, I'm sorry that
I didn't keep the error message and running it again would take a lot of
time. But I remember the error type as it appeared before, either
'C++Signature' or 'NoneType Has no attribute Getatoms' or' Molecule is None '.
-Best,
Wang Jialuo
?????? WangJiaLuo
????????????????????????????????????????????85???????????????????? ??????117004
Address:Shenyang Pharmaceutical University, 85 Hongliu Rd., Benxi City,
Liaoning Province, 117004, P.R.China
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss