> On Sep 3, 2019, at 17:48, Craig James <cja...@emolecules.com> wrote: > A good resource for this could be John Bradshaw and David Livingston's > chapter, "A History of the Development of Data Mining in Pharmaceutical > Research".
On Sep 3, 2019, at 18:18, Kevin Maik Jablonka <kevin.jablo...@epfl.ch> wrote: > Also, there is an ongoing 'bucket list papers' project: > https://www.medchemica.com/bucket-list/ which might be relevant. I think the problem with Christoph's list, and with the pointers from Craig and Kevin, is the question - what's the goal? "Teaching and documentation purposes" is certainly too broad. As I pointed out in private email to Christoph, the seminal papers just on the topic of substructure search include: - introduced in Zatopleg, based on constraints, and not implemented, nor is it easy to find. But it was definitely influential in the 1950s! - substructure search implemented by Opler, but not in the literature - first implemented in Ray and Kirsch, where they had to (re)invent the concept which a few years later was named 'backtracking'. This was so horribly expensive that the project had the code name "H-Bomb" because it was something they hoped they would never have to use. - Zatopleg search implemented at BASF, It was still very expensive. - Sussenguth invented techniques to make it more feasible, as his PhD thesis. It's hard to follow, and there are failure cases. - Ullmann algorithm (fixes those failure cases) - VF2 paper (improves on Ullmann for chemical graphs) RDKit and CDK use VF2. Likely other toolkits too. RDKit and CDK blogged about the observed improvements, - then more formally published in Ehrlich and Rarey While for teaching purposes, I would instead point to John Barnard's review paper, rather than seminal papers: Barnard, J. M. Substructure Searching Methods: Old and New. J. Chem. Inf. Comput. Sci. 1993, 33 (4), 532–538. https://doi.org/10.1021/ci00014a001. and probably Ehrlich and Rarey as a follow-up. Barnard's paper is more comprehensive than my summary above, though he didn't list the hard-to-find Zatopleg connection. I looked at John Bradshaw and David Livingston's chapter, and note that they (correctly) describe it as the "authors’ personal experiences in the development of chemistry data mining technologies since the early 1970s". That means, for example, that their treatment of (edge-notched) punched cards is incomplete. Eg, some edge-notched punched cards had multiple layers, allowing trinary or better searches, not just Boolean searches. Also, I'm pretty sure that some organizations used machine-sorted interior punched cards in the 1940s and 1950s to search "all of the company compound database" even before electronic databases. Both were rare, and not part of their personal experience. Let me be clear - Bradshaw's writings on the topic were a strong influence on getting me interested in this historical topic! I'm instead highlighting the question - what is the point of this list of references? As another example from their chapter, they discuss WLN, which is effectively irrelevant to modern cheminformatics but was very important from the 1950s to the 1980s. Any in-depth understanding of the topic would need to start with Dyson notation, which was carbon chain-based as it was meant to reflect the IUPAC nomenclature, but encoded for punched cards. I believe Wiswesser was influenced by Dyson notation. Dyson notation went on to become an IUPAC standard in 1960s, but that was its last gasp of breath as effectively everyone moved either to WLN or canonical connection tables. I'm also confused about the parenthetical comment "Here the four sections of the WLN have been separated by spaces (which does not happen in a regular WLN string)" as WLN most assuredly used spaces. See for example: Wiswesser 'The Empty Column' Revisited - A Chemical Notation that Appeared with Computer Languages in 1950 http://www.dtic.mil/docs/citations/AD0706152 https://archive.org/details/DTIC_AD0706152 Finally, I think they underestimate the capabilities of pre-1970s (before-their-time) technology, as some of the things they mention as dating from MACCS were available in the earlier CAS, CIDS and BASF systems. However, CIDS and the BASF work rarely appear in modern history, in part because they were developed by the US military (CIDS) - not in the scientific literature - or in German (BASF). For example, CIDS used a simple nomenclature, derived from the postfix syntax described in the papers of Hiz and Eisman (two different papers). It was about as simple as SMILES ... but apparently did not influence the community at large, for reasons I'm still trying to understand. On Sep 3, 2019, at 18:18, Kevin Maik Jablonka <kevin.jablo...@epfl.ch> wrote: > Also, there is an ongoing 'bucket list papers' project: > https://www.medchemica.com/bucket-list/ which might be relevant. I don't agree the ECFP fingerprint description. It is not "one of the first papers describing a method to represent chemical structures in a computer as a unique fingerprint". There are many earlier fingerprint papers than 2010, and ECFP fingerprints are not unique. The paper even says it stops after a fixed number of Morgan iterations, rather than finding unique atom labels (up to symmetry), and "Note that the relationship between fingerprint features and the substructures may not always be one-to-one, that is, different substructural representations may share the same identifier (and, more rarely, different identifiers may represent the same underlying substructural representation)" Or, the subtitle is "The next step in representing molecules as a single number" but ECFP fingerprints aren't single numbers - as the paper says, it generates an array of identifiers. Plus, I think the concept of what are now called circular fingerprints was first explored by Penny back in the 1960s (and referenced, for example, in Willett's PhD thesis), as well as in DARCS's "FRELs" - though I find the DARCS papers to be impenetrable. Do students really want to know all of this detail? No. Should it be documented better, and more completely? Yes. Is there any money for it? Not at all. And there's only a few people who read through the really old literature, much less the grey literature like the CIDS Army publications. Cheers, Andrew da...@dalkescientific.com _______________________________________________ Blueobelisk-discuss mailing list Blueobelisk-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss