Re: [BlueObelisk-discuss] Compiling a List of Seminal Papers in Cheminformatics

Andrew Dalke Wed, 04 Sep 2019 04:46:00 -0700

> On Sep 3, 2019, at 17:48, Craig James <cja...@emolecules.com> wrote:
>  A good resource for this could be John Bradshaw and David Livingston's 
> chapter, "A History of the Development of Data Mining in Pharmaceutical 
> Research".

On Sep 3, 2019, at 18:18, Kevin Maik Jablonka <kevin.jablo...@epfl.ch> wrote:
> Also, there is an ongoing 'bucket list papers' project: 
> https://www.medchemica.com/bucket-list/ which might be relevant.

I think the problem with Christoph's list, and with the pointers from Craig and 
Kevin, is the question - what's the goal?

"Teaching and documentation purposes" is certainly too broad. As I pointed out 
in private email to Christoph, the seminal papers just on the topic of 
substructure search include:

- introduced in Zatopleg, based on constraints, and not implemented,
   nor is it easy to find. But it was definitely influential in the 1950s!
- substructure search implemented by Opler, but not in the literature
- first implemented in Ray and Kirsch, where they had to (re)invent
   the concept which a few years later was named 'backtracking'. This
   was so horribly expensive that the project had the code name "H-Bomb"
   because it was something they hoped they would never have to use.
- Zatopleg search implemented at BASF, 
    It was still very expensive.
- Sussenguth invented techniques to make it more feasible, as his PhD
    thesis. It's hard to follow, and there are failure cases.
- Ullmann algorithm (fixes those failure cases)
- VF2 paper (improves on Ullmann for chemical graphs)
    RDKit and CDK use VF2. Likely other toolkits too.
    RDKit and CDK blogged about the observed improvements,
 - then more formally published in Ehrlich and Rarey

While for teaching purposes, I would instead point to John Barnard's review 
paper, rather than seminal papers:

Barnard, J. M. Substructure Searching Methods: Old and New. J. Chem. Inf. 
Comput. Sci. 1993, 33 (4), 532–538. https://doi.org/10.1021/ci00014a001.

and probably Ehrlich and Rarey as a follow-up.

Barnard's paper is more comprehensive than my summary above, though he didn't 
list the hard-to-find Zatopleg connection.

I looked at John Bradshaw and David Livingston's chapter, and note that they 
(correctly) describe it as the "authors’ personal experiences in the 
development of chemistry data mining technologies since the early 1970s".

That means, for example, that their treatment of (edge-notched) punched cards 
is incomplete. Eg, some edge-notched punched cards had multiple layers, 
allowing trinary or better searches, not just Boolean searches. Also, I'm 
pretty sure that some organizations used machine-sorted interior punched cards 
in the 1940s and 1950s to search "all of the company compound database" even 
before electronic databases. Both were rare, and not part of their personal 
experience.

Let me be clear - Bradshaw's writings on the topic were a strong influence on 
getting me interested in this historical topic! I'm instead highlighting the 
question - what is the point of this list of references?

As another example from their chapter, they discuss WLN, which is effectively 
irrelevant to modern cheminformatics but was very important from the 1950s to 
the 1980s. Any in-depth understanding of the topic would need to start with 
Dyson notation, which was carbon chain-based as it was meant to reflect the 
IUPAC nomenclature, but encoded for punched cards. I believe Wiswesser was 
influenced by Dyson notation. Dyson notation went on to become an IUPAC 
standard in 1960s, but that was its last gasp of breath as effectively everyone 
moved either to WLN or canonical connection tables.

I'm also confused about the parenthetical comment "Here the four sections of 
the WLN have been separated by spaces (which does not happen in a regular WLN 
string)" as WLN most assuredly used spaces. See for example:

Wiswesser 'The Empty Column' Revisited - A Chemical Notation that Appeared with 
Computer Languages in 1950
http://www.dtic.mil/docs/citations/AD0706152
https://archive.org/details/DTIC_AD0706152

Finally, I think they underestimate the capabilities of pre-1970s 
(before-their-time) technology, as some of the things they mention as dating 
from MACCS were available in the earlier CAS, CIDS and BASF systems. However, 
CIDS and the BASF work rarely appear in modern history, in part because they 
were developed by the US military (CIDS) - not in the scientific literature - 
or in German (BASF). For example, CIDS used a simple nomenclature, derived from 
the postfix syntax described in the papers of Hiz and Eisman (two different 
papers). It was about as simple as SMILES ... but apparently did not influence 
the community at large, for reasons I'm still trying to understand.

On Sep 3, 2019, at 18:18, Kevin Maik Jablonka <kevin.jablo...@epfl.ch> wrote:
> Also, there is an ongoing 'bucket list papers' project: 
> https://www.medchemica.com/bucket-list/ which might be relevant.

I don't agree the ECFP fingerprint description. It is not "one of the first 
papers describing a method to represent chemical structures in a computer as a 
unique fingerprint". There are many earlier fingerprint papers than 2010, and 
ECFP fingerprints are not unique. The paper even says it stops after a fixed 
number of Morgan iterations, rather than finding unique atom labels (up to 
symmetry), and "Note that the relationship between fingerprint features and the 
substructures may not always be one-to-one, that is, different substructural 
representations may share the same identifier (and, more rarely, different 
identifiers may represent the same underlying substructural representation)"

Or, the subtitle is "The next step in representing molecules as a single 
number" but ECFP fingerprints aren't single numbers - as the paper says, it 
generates an array of identifiers.

Plus, I think the concept of what are now called circular fingerprints was 
first explored by Penny back in the 1960s (and referenced, for example, in 
Willett's PhD thesis), as well as in DARCS's "FRELs" - though I find the DARCS 
papers to be impenetrable.

Do students really want to know all of this detail?

No.

Should it be documented better, and more completely?

Yes.

Is there any money for it? Not at all. And there's only a few people who read 
through the really old literature, much less the grey literature like the CIDS 
Army publications.

Cheers,

                                Andrew
                                da...@dalkescientific.com

_______________________________________________
Blueobelisk-discuss mailing list
Blueobelisk-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Re: [BlueObelisk-discuss] Compiling a List of Seminal Papers in Cheminformatics

Reply via email to