Hi there,

I typically use python's standardiser 
(https://pypi.python.org/pypi/standardiser) when preparing any molecules for 
machine learning, and I have found the GetHashedAtomPairFingerprintAsBitVect as 
very good tool for input into support vector machines and neural networks.


However, this fingerprint function can fail if input is standardised.


#An example:

m = Chem.MolFromSmiles("C(=O)(c1ccc(cc1)O)O")
fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(m)

#standardise the molecule, using standardiser v0.1.9 
https://pypi.python.org/pypi/standardiser
std_m = rules.run(m)
fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(std_m)

Produces the following error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-221-b73abfdb31ef> in <module>()
      4 #standardise the molecule, using standardiser v0.1.9 
https://pypi.python.org/pypi/standardiser
      5 std_m = rules.run(m)
----> 6 fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(std_m)

RuntimeError: Invariant Violation
        explicit valence exceeds atom degree
        Violation occurred on line 32 in file 
Code/GraphMol/Fingerprints/AtomPairs.cpp
        Failed Expression: val >= atom->getDegree()
        RDKIT: 2017.09.3
        BOOST: 1_63

After a few hours digging through, I have found why this is:
Source code 
(https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/AtomPairs.cpp)
 :

unsigned int res = 0;


if (atom->getIsAromatic()) {
    res = 1;
} else if (atom->getHybridization() != Atom::SP3) {
unsigned int val = static_cast<unsigned int>(atom->getExplicitValence());
    val -= atom->getNumExplicitHs();
    CHECK_INVARIANT(val >= atom->getDegree(),
                    "explicit valence exceeds atom degree");
res = val - atom->getDegree();

>From what I can gather, standardisation adds explicit hydrogens (although has 
>no way to turn this off), whereas default sanitisation

does not.


When iterating through each atom in the standardised molecule (mol.GetAtoms()) 
and checking the explicit valence (a.GetExplicitValence), explicit hydrogens 
(a.GetNumExplicitHs) and degree (a.GetDegree) it is easy to see how the error 
is caused:

Standardised molecule
Atoms: ['C', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'O']
Explicit Valence: [4, 2, 4, 3, 3, 4, 3, 3, 1, 1]
Explicit Hs: [0, 0, 0, 1, 1, 0, 1, 1, 1, 1]
Valence (Explicit Valence - Explicit Hs): [4, 2, 4, 2, 2, 4, 2, 2, 0, 0]
Degrees: [3, 1, 3, 2, 2, 3, 2, 2, 1, 1]
Valence >= Degrees: [True, True, True, True, True, True, True, True, False, 
False]


Should GetHashedAtomPairFingerprint be coded so that explicit hydrogens are 
added before performing this check so that issues when standardising no longer 
occur?

System info:
Python 3.6.3
RDKit 2017.09.03
standardiser 0.1.9

Kind Regards,


Rebecca Mackenzie


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to