Note, the location of the first opening parenthesis is different: >>> 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'.find('(') 13 >>> 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'.find('(') 15
So the SMILES are syntactically correct to represent 2 and 3 nitrocarbazole, though semantically weird as they're a mixture of kekule and aromatic smiles, would expect something of either of these two forms, pulled from the pubchem website: In [3]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=CC(=C3)[N+](=O)[O-]') Out[3]: 'O=[N+]([O-])c1ccc2[nH]c3ccccc3c2c1' In [4]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=C(C=C3)[N+](=O)[O-]') Out[4]: 'O=[N+]([O-])c1ccc2c(c1)[nH]c1ccccc12' The original 2 smiles seem to be trying to represent a protonation state with the nitrogen deprotonated, causing RDKit to think carbon 5 to have a valence of 5: '=NC3=': In [5]: Chem.CanonSmiles('c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]') [18:24:50] Explicit valence for atom # 10 C, 5, is greater than permitted This looks like a disagreement with another toolkit about how to parse this SMILES: In [6]: from openeye.oechem import * In [7]: mol = OEMol() In [8]: OESmilesToMol(mol, 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]') Out[8]: True In [9]: OEMolToSmiles(mol) Out[9]: 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]' In [10]: for bond in mol.GetAtom(OEHasAtomIdx(5)).GetBonds(): ...: print(bond.GetOrder()) ...: 1 1 2 -Brian On Wed, Mar 4, 2020 at 6:04 PM Lewis Martin <lewis.marti...@gmail.com> wrote: > Hi all, > Im chasing up a small puzzle on parsing SMILES codes if anyone's > interested, but its not directly RDkit. I was looking at the molecules in > the MUTAG dataset, which is commonly used in graph learning research. > Mostly these are just shared as graphs (i.e. vertices and edges) rather > than SMILES codes, but I did find a list of SMILES at ChemDB: > http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py > > RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero > indexing - and youll note that they are the same: > > smiles[82] > > 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]' > > > smiles[187] > > 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]' > > > I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188 > - 3-nitrocarbazole' (1-indexed). So the smiles codes should be similar > but not exactly the same. > > Can anyone please tell me if those smiles codes are legit or has there > been a transcription error? It's easy to replace these with the correct > codes from pubchem, but if you're familiar with the dataset is it safe to > trust the other codes? Why does the paper have 197 molecules but the > dataset only has 188? > > Thanks for your time! > > > Pubchem for 2nitrocarbazole: > https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI > Pubchem for 3nitrocarbazole: > https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole > > > > *original paper is "Structure-activity relationship of mutagenic aromatic > and heteroaromatic nitro compounds. Correlation with molecular orbital > energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902 > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss