Hi all,

  RDKit implements Tanimoto similarity for count fingerprints. I only last week 
realized there's been a change in what "Tanimoto similarity" means for count 
fingerprints, and RDKit seems to be the reason for the shift. I'm curious to 
know the history.

* Tanimoto #1 is Σaᵢbᵢ/(Σaᵢ²+Σbᵢ²-Σaᵢbᵢ), that is, it interprets count 
fingerprints as a vector

The oldest citation I have is Bawden, "Browsing and Clustering of Chemical 
Structures" on p147 of "Chemical structures" (1988) from the first ICCS.

A more accessible citation is Willett, "Chemical Similarity Searching" JCICS 
(1998) 38, 983-996 available at 
https://web.archive.org/web/20040218213916/http://www-personal.engin.umich.edu:80/~wildd/che697/willett98.pdf
 . See page 987, the "formula for continuous values" under "Tanimoto 
Coefficient".

My literature search shows it was the main definition for almost 30 years.

* Tanimoto #2 is Σmin(aᵢ,bᵢ)/Σmax(aᵢ,bᵢ), that is, what Wikipedia calls the 
"weighted Jaccard similarity."

This is what RDKit uses. It was committed to Code/DataStructs/SparseIntVect.h 
on 2009-Jun-18, as part of adding Tversky similarity, and a couple of years 
after adding Dice similarity.

I believe that as a result of RDKit's popularity, recent papers have taking to 
describing this as, for example, "the counted Tanimoto similarity" in like 
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-025-01081-6 ("also 
known as the multiset coefficient calculation").

Does anyone here know how RDKit came to be the way it is?


In my literature search, I believe the similarity function for Tanimoto #2 was 
first proposed by Henry Allan Gleason, "Some Applications of the Quadrat 
Method", Bulletin of the Torrey Botanical Club, Vol. 47, No. 1 (Jan., 1920), 
pp. 21-33, starting on page 31 where he proposes adding species abundance to 
Jaccard's similarity. See 
https://archive.org/details/jstor-2480223/page/n11/mode/2up 

Some people (and https://en.wikipedia.org/wiki/Jaccard_index) refer to this as 
Ruzicka similarity, from Ruzicka (1958), but on the Mastodon discussion at 
https://mstdn.science/@molecule/115142680945701031 you'll wim 
(@[email protected]) got a copy of the relevant part of Ruzicka's paper, 
and it appears to be identical to Gleason's extension to Jaccard similarity -- 
not even in the cool looking min/max formulation as attributed in, eg, 
https://archive.org/details/dictionaryofdist0000deza/mode/2up?q=Ruzicka .

The first paper which applied Tanimoto #2 to fingerprints appears to be 
introduced by Swamidass et al., "Kernels for small molecules and the prediction 
of mutagenicity, toxicity and anti-cancer activity", Bioinformatics, Volume 21, 
Issue suppl_1, June 2005, Pages i359–i368, 
https://doi.org/10.1093/bioinformatics/bti1055 where they call it the "MinMax" 
kernel and explicitly compare it to Tanimoto #1.

Some papers since then refer to Tanimoto #2 as MinMax.

Now, I was able to find a use of (1-Tanimoto #2) as a similarity measure 
("measure" used in its mathematical meaning) in Thomas Ott, Albert Kern, Ausgar 
Schuffenhauer, Maxim Popov, Pierre Acklin, Edgar Jacoby, and Ruedi Stoop, 
"Sequential Superparamagnetic Clustering for Unbiased Classification of 
High-Dimensional Chemical Data", J. Chem. Inf. Comput. Sci. 2004, 44, 1358-1364 
available from https://tilde.ini.uzh.ch/users/tott/public_html/jcheminf.pdf but 
it is unnamed -- and a measure, not a similarity.

That makes me quite curious on how RDKit ended up the way it does.

To be clear, I prefer the similarity function given in #2 over that of #1, 
though I think having two "Tanimoto" definitions is going to be confusing. If 
only the Sheffield folks back in the 1980s had known. But hey, that's how we 
ended up with "Tanimoto" instead of "Jaccard". :)

Best regards,

                                Andrew
                                [email protected]

P.S.
  If anyone knows of older citation, please let me know. There aren't good 
search tools for finding this formula, so it's a lot of tedious manual work.



_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to