Here's another example of how it's important to know the clear goal of collecting such a list.
One of the entries someone added to the spreadsheet is: Tanimoto, Taffee T. (17 Nov 1958). "An Elementary Mathematical theory of Classification and Prediction". Internal IBM Technical Report. 1957 (8?). I'm going to argue that it's not useful. This is likely present because it's the reason why we call Tanimoto "Tanimoto". However, I don't think it's worthwhile to point to that citation. Here's the history as I know it: The first paper to really use the Tanimoto for similarity search was: Willett, P.; Winterman, V.; Bawden, D. Implementation of Nearest-Neighbor Searching in an Online Chemical Structure Search System. Journal of Chemical Information and Computer Sciences 1986, 26 (1), 36–41. https://doi.org/10.1021/ci00049a008. Others quickly picked up on it, because 1) it was easy to do - Willett told me that one of the first external implementation took an afternoon to implement, and 2) bitstrings were already present because everyone already had pre-computed MACCS keys. The choice of Tanimoto was based on a comparison of several different schemes, in: Willett, P.; Winterman, V. A Comparison of Some Measures for the Determination of Inter-Molecular Structural Similarity Measures of Inter-Molecular Structural Similarity. Quant. Struct.-Act. Relat. 1986, 5 (1), 18–25. https://doi.org/10.1002/qsar.19860050105. (That's the one which should property be quoted as demonstrating that the Tanimoto was at least as effective as the others, and easiest to implement, so was chosen. The two papers were jointly published, and reference each other "in press".) However, you'll notice that neither paper cites Tanimoto. Instead, they cite earlier work by Adamson and Bush. These are: Adamson, G. W.; Bush, J. A. A Method for the Automatic Classification of Chemical Structures. Information Storage and Retrieval 1973, 9 (10), 561–568. https://doi.org/10.1016/0020-0271(73)90059-4. Adamson, G. W.; Bush, J. A. A Comparison of the Performance of Some Similarity and Dissimilarity Measures in the Automatic Classification of Chemical Structures. Journal of Chemical Information and Computer Sciences 1975, 15 (1), 55–58. https://doi.org/10.1021/ci60001a016. The 1975 paper cites David J. Rogers, Taffee T. Tanimoto, A Computer Program for Classifying Plants, Science, 21 Oct 1960 1115-1118. https://science.sciencemag.org/content/132/3434/1115 More specifically, it's on p56 of the paper, starting on the last sentence of the first column, going to the top of the second column: Several coefficients have been proposed based on this criterion, [8-10,14-16] and some of these were used in the classification of the anesthetics." The 1973 paper neither cites Tanimoto nor uses a Tanimoto similarity. So it appears that Adamson et al. investigated bitstrings using other comparison methods, while Willett et al. were the first to investigate the Tanimoto. For several years after Willett et al. there is few/no citations to Tanimoto (1958) or to Rogers and Tanimoto (1960). As an example of one of the indirect citations, see: Grethe, G.; Moock, T. E. Similarity Searching in REACCS. A New Tool for the Synthetic Chemist. J. Chem. Inf. Comput. Sci. 1990, 30 (4), 511–520. https://doi.org/10.1021/ci00068a025. where the Tanimoto is citation (9): "Ref 1; p54", where reference 1 from the same paper is: Willett, P. Similarity and Clustering in Chemical Information Systems; Reseach Studies Press: Letchworth, Herfordshire, England, 1987 Now, go back to the citation that's currently in the spreadsheet: Tanimoto, Taffee T. (17 Nov 1958). "An Elementary Mathematical theory of Classification and Prediction". Internal IBM Technical Report. 1957 (8?). What does "*Internal* IBM Technical Report" mean?! Wikipedia used to describe this as "unavailable". I pointed out that it is available through worldcat, and I got a copy from SUB Göttingen : https://en.wikipedia.org/w/index.php?title=Jaccard_index&diff=704793261&oldid=688763411 It's at http://dalkescientific.com/tanimoto.pdf for the really curious. I can't figure out why anyone would refer a student to 1) an internal publication, where 2) it's so hard to get, especially given 3) the actual cheminformatics literature references a 1960 Science publication which is a further refinement of the internal report. My guess is that it's one of those citations that everyone passes around, but which no one has actually read. (If Tanimoto's 1958 internal report is a good citation, then I have a copy of the internal National Bureau of Standards publication by Ray and Kirsch from 1956, which predates their widely-cited 1957 Science publication: Ray, Louis and Russell A. Kirsch. The Use of Automatic Data Processing Systems in the Retrieval of Technical Information; National Bureau of Standards Report 5115, 1956 I had to get that from a used book dealer.) But wait, I'm not done yet. The Tanimoto we use is the same as the Jaccard similarity, so perhaps we should point students to that instead? The citation is: Jaccard, Paul. "Étude comparative de la distribution florale dans une portion des Alpes et des Jura." Bull Soc Vaudoise Sci Nat 37 (1901): 547-579. However, are students supposed to know French to read it? Or should we point to the English translation at: THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE. Paul Jaccard First published: February 1912 https://doi.org/10.1111/j.1469-8137.1912.tb05611.x https://nph.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-8137.1912.tb05611.x (The cheminformatics literature has at least one paper which cites the original French, and at least one paper which cites the English translation.) In any case, there's really no connection between those papers and cheminformatics, other than for those interested in tracing the concept. That's why I think the Willett et al. paper(s) are all that a student really needs to read for the relevant history. While someone like me would like to read/document the more complete history. Andrew da...@dalkescientific.com _______________________________________________ Blueobelisk-discuss mailing list Blueobelisk-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss