Hi all,
Im chasing up a small puzzle on parsing SMILES codes if anyone's
interested, but its not directly RDkit. I was looking at the molecules in
the MUTAG dataset, which is commonly used in graph learning research.
Mostly these are just shared as graphs (i.e. vertices and edges) rather
than SMILES codes, but I did find a list of SMILES at ChemDB:
http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py

RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero
indexing - and youll note that they are the same:

smiles[82]

'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'


smiles[187]

'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'


I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188 -
3-nitrocarbazole' (1-indexed). So the smiles codes should be similar but
not exactly the same.

Can anyone please tell me if those smiles codes are legit or has there been
a transcription error? It's easy to replace these with the correct codes
from pubchem, but if you're familiar with the dataset is it safe to trust
the other codes? Why does the paper have 197 molecules but the dataset only
has 188?

Thanks for your time!


Pubchem for 2nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI
Pubchem for 3nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole



*original paper is "Structure-activity relationship of mutagenic aromatic
and heteroaromatic nitro compounds. Correlation with molecular orbital
energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to