Here's another example of how it's important to know the clear goal of 
collecting such a list.

One of the entries someone added to the spreadsheet is:
  Tanimoto, Taffee T. (17 Nov 1958).
  "An Elementary Mathematical theory of Classification and Prediction".
  Internal IBM Technical Report. 1957 (8?).

I'm going to argue that it's not useful.


This is likely present because it's the reason why we call Tanimoto "Tanimoto".

However, I don't think it's worthwhile to point to that citation. Here's the 
history as I know it:

The first paper to really use the Tanimoto for similarity search was:

Willett, P.; Winterman, V.; Bawden, D. Implementation of Nearest-Neighbor 
Searching in an Online Chemical Structure Search System. Journal of Chemical 
Information and Computer Sciences 1986, 26 (1), 36–41. 
https://doi.org/10.1021/ci00049a008.

Others quickly picked up on it, because 1) it was easy to do - Willett told me 
that one of the first external implementation took an afternoon to implement, 
and 2) bitstrings were already present because everyone already had 
pre-computed MACCS keys.

The choice of Tanimoto was based on a comparison of several different schemes, 
in:

Willett, P.; Winterman, V. A Comparison of Some Measures for the Determination 
of Inter-Molecular Structural Similarity Measures of Inter-Molecular Structural 
Similarity. Quant. Struct.-Act. Relat. 1986, 5 (1), 18–25. 
https://doi.org/10.1002/qsar.19860050105.

(That's the one which should property be quoted as demonstrating that the 
Tanimoto was at least as effective as the others, and easiest to implement, so 
was chosen. The two papers were jointly published, and reference each other "in 
press".)

However, you'll notice that neither paper cites Tanimoto. Instead, they cite 
earlier work by Adamson and Bush. These are:

Adamson, G. W.; Bush, J. A. A Method for the Automatic Classification of 
Chemical Structures. Information Storage and Retrieval 1973, 9 (10), 561–568. 
https://doi.org/10.1016/0020-0271(73)90059-4.

Adamson, G. W.; Bush, J. A. A Comparison of the Performance of Some Similarity 
and Dissimilarity Measures in the Automatic Classification of Chemical 
Structures. Journal of Chemical Information and Computer Sciences 1975, 15 (1), 
55–58. https://doi.org/10.1021/ci60001a016.

The 1975 paper cites David J. Rogers, Taffee T. Tanimoto, A Computer Program 
for Classifying Plants, Science, 21 Oct 1960 1115-1118. 
https://science.sciencemag.org/content/132/3434/1115

More specifically, it's on p56 of the paper, starting on the last sentence of 
the first column, going to the top of the second column:

   Several coefficients have been proposed based on this criterion, 
[8-10,14-16] and some of these were used in the classification of the 
anesthetics."

The 1973 paper neither cites Tanimoto nor uses a Tanimoto similarity.

So it appears that Adamson et al. investigated bitstrings using other 
comparison methods, while Willett et al. were the first to investigate the 
Tanimoto.

For several years after Willett et al. there is few/no citations to Tanimoto 
(1958) or to Rogers and Tanimoto (1960). As an example of one of the indirect 
citations, see:

Grethe, G.; Moock, T. E. Similarity Searching in REACCS. A New Tool for the 
Synthetic Chemist. J. Chem. Inf. Comput. Sci. 1990, 30 (4), 511–520. 
https://doi.org/10.1021/ci00068a025.

where the Tanimoto is citation (9): "Ref 1; p54", where reference 1 from the 
same paper is: Willett, P. Similarity and Clustering in Chemical Information 
Systems; Reseach Studies Press: Letchworth, Herfordshire, England, 1987


Now, go back to the citation that's currently in the spreadsheet:
  Tanimoto, Taffee T. (17 Nov 1958).
  "An Elementary Mathematical theory of Classification and Prediction".
  Internal IBM Technical Report. 1957 (8?).

What does "*Internal* IBM Technical Report" mean?!

Wikipedia used to describe this as "unavailable". I pointed out that it is 
available through worldcat, and I got a copy from SUB Göttingen :
  
https://en.wikipedia.org/w/index.php?title=Jaccard_index&diff=704793261&oldid=688763411

It's at http://dalkescientific.com/tanimoto.pdf for the really curious.

I can't figure out why anyone would refer a student to 1) an internal 
publication, where 2) it's so hard to get, especially given 3) the actual 
cheminformatics literature references a 1960 Science publication which is a 
further refinement of the internal report.

My guess is that it's one of those citations that everyone passes around, but 
which no one has actually read.

(If Tanimoto's 1958 internal report is a good citation, then I have a copy of 
the internal National Bureau of Standards publication by Ray and Kirsch from 
1956, which predates their widely-cited 1957 Science publication:

Ray, Louis and Russell A. Kirsch. The Use of Automatic Data Processing Systems 
in the Retrieval of Technical Information; National Bureau of Standards Report 
5115, 1956

I had to get that from a used book dealer.)


But wait, I'm not done yet.

The Tanimoto we use is the same as the Jaccard similarity, so perhaps we should 
point students to that instead?

The citation is:

  Jaccard, Paul. "Étude comparative de la distribution florale dans une portion 
des Alpes et des Jura." Bull Soc Vaudoise Sci Nat 37 (1901): 547-579.

However, are students supposed to know French to read it? Or should we point to 
the English translation at:

THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.
Paul Jaccard
First published: February 1912
https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
https://nph.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-8137.1912.tb05611.x

(The cheminformatics literature has at least one paper which cites the original 
French, and at least one paper which cites the English translation.)

In any case, there's really no connection between those papers and 
cheminformatics, other than for those interested in tracing the concept.

That's why I think the Willett et al. paper(s) are all that a student really 
needs to read for the relevant history. While someone like me would like to 
read/document the more complete history.


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Blueobelisk-discuss mailing list
Blueobelisk-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Reply via email to