Hi there,
I typically use python's standardiser (https://pypi.python.org/pypi/standardiser) when preparing any molecules for machine learning, and I have found the GetHashedAtomPairFingerprintAsBitVect as very good tool for input into support vector machines and neural networks. However, this fingerprint function can fail if input is standardised. #An example: m = Chem.MolFromSmiles("C(=O)(c1ccc(cc1)O)O") fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(m) #standardise the molecule, using standardiser v0.1.9 https://pypi.python.org/pypi/standardiser std_m = rules.run(m) fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(std_m) Produces the following error: RuntimeError Traceback (most recent call last) <ipython-input-221-b73abfdb31ef> in <module>() 4 #standardise the molecule, using standardiser v0.1.9 https://pypi.python.org/pypi/standardiser 5 std_m = rules.run(m) ----> 6 fp = rdMolDescriptors.GetHashedAtomPairFingerprintAsBitVect(std_m) RuntimeError: Invariant Violation explicit valence exceeds atom degree Violation occurred on line 32 in file Code/GraphMol/Fingerprints/AtomPairs.cpp Failed Expression: val >= atom->getDegree() RDKIT: 2017.09.3 BOOST: 1_63 After a few hours digging through, I have found why this is: Source code (https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/Fingerprints/AtomPairs.cpp) : unsigned int res = 0; if (atom->getIsAromatic()) { res = 1; } else if (atom->getHybridization() != Atom::SP3) { unsigned int val = static_cast<unsigned int>(atom->getExplicitValence()); val -= atom->getNumExplicitHs(); CHECK_INVARIANT(val >= atom->getDegree(), "explicit valence exceeds atom degree"); res = val - atom->getDegree(); >From what I can gather, standardisation adds explicit hydrogens (although has >no way to turn this off), whereas default sanitisation does not. When iterating through each atom in the standardised molecule (mol.GetAtoms()) and checking the explicit valence (a.GetExplicitValence), explicit hydrogens (a.GetNumExplicitHs) and degree (a.GetDegree) it is easy to see how the error is caused: Standardised molecule Atoms: ['C', 'O', 'C', 'C', 'C', 'C', 'C', 'C', 'O', 'O'] Explicit Valence: [4, 2, 4, 3, 3, 4, 3, 3, 1, 1] Explicit Hs: [0, 0, 0, 1, 1, 0, 1, 1, 1, 1] Valence (Explicit Valence - Explicit Hs): [4, 2, 4, 2, 2, 4, 2, 2, 0, 0] Degrees: [3, 1, 3, 2, 2, 3, 2, 2, 1, 1] Valence >= Degrees: [True, True, True, True, True, True, True, True, False, False] Should GetHashedAtomPairFingerprint be coded so that explicit hydrogens are added before performing this check so that issues when standardising no longer occur? System info: Python 3.6.3 RDKit 2017.09.03 standardiser 0.1.9 Kind Regards, Rebecca Mackenzie ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss