Note, the location of the first opening parenthesis is different:

>>> 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'.find('(')
13
>>> 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'.find('(')
15

So the SMILES are syntactically correct to represent 2 and 3
nitrocarbazole, though semantically weird as they're a mixture of
kekule and aromatic smiles, would expect something of either of these
two forms, pulled from the pubchem website:


In [3]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=CC(=C3)[N+](=O)[O-]')
Out[3]: 'O=[N+]([O-])c1ccc2[nH]c3ccccc3c2c1'

In [4]: Chem.CanonSmiles('C1=CC=C2C(=C1)C3=C(N2)C=C(C=C3)[N+](=O)[O-]')
Out[4]: 'O=[N+]([O-])c1ccc2c(c1)[nH]c1ccccc12'

The original 2 smiles seem to be trying to represent a protonation
state with the nitrogen deprotonated, causing RDKit to think carbon 5
to have a valence of 5: '=NC3=':


In [5]: Chem.CanonSmiles('c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]')
[18:24:50] Explicit valence for atom # 10 C, 5, is greater than permitted


This looks like a disagreement with another toolkit about how to parse
this SMILES:


In [6]: from openeye.oechem import *

In [7]: mol = OEMol()

In [8]: OESmilesToMol(mol, 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]')
Out[8]: True

In [9]: OEMolToSmiles(mol)
Out[9]: 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'

In [10]: for bond in mol.GetAtom(OEHasAtomIdx(5)).GetBonds():
    ...:     print(bond.GetOrder())
    ...:
1
1
2

-Brian


On Wed, Mar 4, 2020 at 6:04 PM Lewis Martin <lewis.marti...@gmail.com>
wrote:

> Hi all,
> Im chasing up a small puzzle on parsing SMILES codes if anyone's
> interested, but its not directly RDkit. I was looking at the molecules in
> the MUTAG dataset, which is commonly used in graph learning research.
> Mostly these are just shared as graphs (i.e. vertices and edges) rather
> than SMILES codes, but I did find a list of SMILES at ChemDB:
> http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py
>
> RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero
> indexing - and youll note that they are the same:
>
> smiles[82]
>
> 'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'
>
>
> smiles[187]
>
> 'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'
>
>
> I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188
> - 3-nitrocarbazole' (1-indexed). So the smiles codes should be similar
> but not exactly the same.
>
> Can anyone please tell me if those smiles codes are legit or has there
> been a transcription error? It's easy to replace these with the correct
> codes from pubchem, but if you're familiar with the dataset is it safe to
> trust the other codes? Why does the paper have 197 molecules but the
> dataset only has 188?
>
> Thanks for your time!
>
>
> Pubchem for 2nitrocarbazole:
> https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI
> Pubchem for 3nitrocarbazole:
> https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole
>
>
>
> *original paper is "Structure-activity relationship of mutagenic aromatic
> and heteroaromatic nitro compounds. Correlation with molecular orbital
> energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to