I was surprised that, using topological fingerprints, the tanimoto
similarity between benzene and toluene is 0.32
Examining the fp bits, I can see why. But I don't understand why so
many paths are repeated for toluene.
To my way of thinking, paths that trace the same types of atoms should
not be considered different, and therefore
set new bits. Am I missing something?
Here is my sample code:
from rdkit import Chem
from rdkit.Chem import RDKFingerprint
from rdkit import DataStructs
import sys
smiles = ['c1ccccc1', 'Cc1ccccc1']
fps = list()
for smi in smiles:
mol = Chem.MolFromSmiles(smi)
fps.append(RDKFingerprint(mol))
#fps.append(RDKFingerprint(mol, 1, 7, 1024, 3, True, 0.0, 1024))
for fp in fps:
#print fp.ToBitString()
i = 0
bitlist = list()
for bit in fp:
i += 1
if bit: bitlist.append(i)
print bitlist
print DataStructs.FingerprintSimilarity(fps[0], fps[1])
and the output I get is:
[12, 18, 57, 72, 180, 199, 558, 590, 712, 858, 990, 999, 1221, 1277,
1446, 1582, 1639, 1787, 1829, 1879, 1914, 1952, 1986, 2021]
[12, 18, 57, 72, 123, 180, 199, 215, 242, 255, 301, 324, 361, 447,
518, 526, 558, 570, 590, 595, 610, 693, 703, 712, 745, 778, 857, 858,
891, 896, 927, 933, 961, 964, 968, 990, 999, 1012, 1022, 1047, 1065,
1090, 1100, 1108, 1134, 1172, 1188, 1221, 1228, 1243, 1268, 1277,
1287, 1297, 1306, 1345, 1446, 1503, 1514, 1538, 1582, 1593, 1622,
1626, 1639, 1665, 1691, 1787, 1829, 1873, 1879, 1914, 1952, 1986,
2021]
0.32
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss