Regarding Tanimoto, credit should be given to the original source, whatever language it's written in. From memory, la distribution des flores dans la zone alpine, I believe.
On Wed, 4 Sep 2019, 20:21 Andrew Dalke, <da...@dalkescientific.com> wrote: > Here's another example of how it's important to know the clear goal of > collecting such a list. > > One of the entries someone added to the spreadsheet is: > Tanimoto, Taffee T. (17 Nov 1958). > "An Elementary Mathematical theory of Classification and Prediction". > Internal IBM Technical Report. 1957 (8?). > > I'm going to argue that it's not useful. > > > This is likely present because it's the reason why we call Tanimoto > "Tanimoto". > > However, I don't think it's worthwhile to point to that citation. Here's > the history as I know it: > > The first paper to really use the Tanimoto for similarity search was: > > Willett, P.; Winterman, V.; Bawden, D. Implementation of Nearest-Neighbor > Searching in an Online Chemical Structure Search System. Journal of > Chemical Information and Computer Sciences 1986, 26 (1), 36–41. > https://doi.org/10.1021/ci00049a008. > > Others quickly picked up on it, because 1) it was easy to do - Willett > told me that one of the first external implementation took an afternoon to > implement, and 2) bitstrings were already present because everyone already > had pre-computed MACCS keys. > > The choice of Tanimoto was based on a comparison of several different > schemes, in: > > Willett, P.; Winterman, V. A Comparison of Some Measures for the > Determination of Inter-Molecular Structural Similarity Measures of > Inter-Molecular Structural Similarity. Quant. Struct.-Act. Relat. 1986, 5 > (1), 18–25. https://doi.org/10.1002/qsar.19860050105. > > (That's the one which should property be quoted as demonstrating that the > Tanimoto was at least as effective as the others, and easiest to implement, > so was chosen. The two papers were jointly published, and reference each > other "in press".) > > However, you'll notice that neither paper cites Tanimoto. Instead, they > cite earlier work by Adamson and Bush. These are: > > Adamson, G. W.; Bush, J. A. A Method for the Automatic Classification of > Chemical Structures. Information Storage and Retrieval 1973, 9 (10), > 561–568. https://doi.org/10.1016/0020-0271(73)90059-4. > > Adamson, G. W.; Bush, J. A. A Comparison of the Performance of Some > Similarity and Dissimilarity Measures in the Automatic Classification of > Chemical Structures. Journal of Chemical Information and Computer Sciences > 1975, 15 (1), 55–58. https://doi.org/10.1021/ci60001a016. > > The 1975 paper cites David J. Rogers, Taffee T. Tanimoto, A Computer > Program for Classifying Plants, Science, 21 Oct 1960 1115-1118. > https://science.sciencemag.org/content/132/3434/1115 > > More specifically, it's on p56 of the paper, starting on the last sentence > of the first column, going to the top of the second column: > > Several coefficients have been proposed based on this criterion, > [8-10,14-16] and some of these were used in the classification of the > anesthetics." > > The 1973 paper neither cites Tanimoto nor uses a Tanimoto similarity. > > So it appears that Adamson et al. investigated bitstrings using other > comparison methods, while Willett et al. were the first to investigate the > Tanimoto. > > For several years after Willett et al. there is few/no citations to > Tanimoto (1958) or to Rogers and Tanimoto (1960). As an example of one of > the indirect citations, see: > > Grethe, G.; Moock, T. E. Similarity Searching in REACCS. A New Tool for > the Synthetic Chemist. J. Chem. Inf. Comput. Sci. 1990, 30 (4), 511–520. > https://doi.org/10.1021/ci00068a025. > > where the Tanimoto is citation (9): "Ref 1; p54", where reference 1 from > the same paper is: Willett, P. Similarity and Clustering in Chemical > Information Systems; Reseach Studies Press: Letchworth, Herfordshire, > England, 1987 > > > Now, go back to the citation that's currently in the spreadsheet: > Tanimoto, Taffee T. (17 Nov 1958). > "An Elementary Mathematical theory of Classification and Prediction". > Internal IBM Technical Report. 1957 (8?). > > What does "*Internal* IBM Technical Report" mean?! > > Wikipedia used to describe this as "unavailable". I pointed out that it is > available through worldcat, and I got a copy from SUB Göttingen : > > https://en.wikipedia.org/w/index.php?title=Jaccard_index&diff=704793261&oldid=688763411 > > It's at http://dalkescientific.com/tanimoto.pdf for the really curious. > > I can't figure out why anyone would refer a student to 1) an internal > publication, where 2) it's so hard to get, especially given 3) the actual > cheminformatics literature references a 1960 Science publication which is a > further refinement of the internal report. > > My guess is that it's one of those citations that everyone passes around, > but which no one has actually read. > > (If Tanimoto's 1958 internal report is a good citation, then I have a copy > of the internal National Bureau of Standards publication by Ray and Kirsch > from 1956, which predates their widely-cited 1957 Science publication: > > Ray, Louis and Russell A. Kirsch. The Use of Automatic Data Processing > Systems in the Retrieval of Technical Information; National Bureau of > Standards Report 5115, 1956 > > I had to get that from a used book dealer.) > > > But wait, I'm not done yet. > > The Tanimoto we use is the same as the Jaccard similarity, so perhaps we > should point students to that instead? > > The citation is: > > Jaccard, Paul. "Étude comparative de la distribution florale dans une > portion des Alpes et des Jura." Bull Soc Vaudoise Sci Nat 37 (1901): > 547-579. > > However, are students supposed to know French to read it? Or should we > point to the English translation at: > > THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE. > Paul Jaccard > First published: February 1912 > https://doi.org/10.1111/j.1469-8137.1912.tb05611.x > > https://nph.onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-8137.1912.tb05611.x > > (The cheminformatics literature has at least one paper which cites the > original French, and at least one paper which cites the English > translation.) > > In any case, there's really no connection between those papers and > cheminformatics, other than for those interested in tracing the concept. > > That's why I think the Willett et al. paper(s) are all that a student > really needs to read for the relevant history. While someone like me would > like to read/document the more complete history. > > > Andrew > da...@dalkescientific.com > > > > > _______________________________________________ > Blueobelisk-discuss mailing list > Blueobelisk-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss >
_______________________________________________ Blueobelisk-discuss mailing list Blueobelisk-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss