Hi all, Im chasing up a small puzzle on parsing SMILES codes if anyone's interested, but its not directly RDkit. I was looking at the molecules in the MUTAG dataset, which is commonly used in graph learning research. Mostly these are just shared as graphs (i.e. vertices and edges) rather than SMILES codes, but I did find a list of SMILES at ChemDB: http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py
RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero indexing - and youll note that they are the same: smiles[82] 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]' smiles[187] 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]' I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188 - 3-nitrocarbazole' (1-indexed). So the smiles codes should be similar but not exactly the same. Can anyone please tell me if those smiles codes are legit or has there been a transcription error? It's easy to replace these with the correct codes from pubchem, but if you're familiar with the dataset is it safe to trust the other codes? Why does the paper have 197 molecules but the dataset only has 188? Thanks for your time! Pubchem for 2nitrocarbazole: https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI Pubchem for 3nitrocarbazole: https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole *original paper is "Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss