H Suliman, On Dec 19, 2021, at 05:51, Suliman Sharif <sharifsulim...@gmail.com> wrote >> When was the current state of machine representation figured out? > > I would say the 1980s was after the invention of SMILES where they used > something somewhat "readable", they got it started and now we continue is my > thought there.
As additional factors to think about, 1980s SMILES didn't handle chirality or isotopes. Those were added in the 1990s. Computer databases in the 1970s, like MACCS, could already store those. Indeed, lack of stereochemistry support in WLNs was one of the factors which lead to its decline around 1980. (Stereochemistry was given in human-readable notes.) What could SMILES handle that MACCS's connection tables couldn't in 1980? I don't think the 1980s SMILES representation exceeds that of MCC (mechanical chemical code - https://pubs.acs.org/doi/pdf/10.1021/c160027a002 ), which includes isotopes but not stereochemistry. >> Does cheminformatics include its roots in library science? Or are those now >> different fields? > > I like chem + informatic because it's one character shorter and in my opinion > sounds cooler. I mean you could say it's around the time we started > constructing the IUPAC language trying to turn what's going on chemistry wish > to a language representation and it's a part of library science. But anything > is a part of a library science since we all record scientific information in > some format. I don't think I expressed my question well enough. The current journal "Journal of Chemical Information and Modeling" was previously "Journal of Chemical Information and Computer Science", which was previously "Journal of Computer Documentation". Before J. Chem. Doc., papers were published in American Documentation or the Journal of Computer Education. The word "Documentation" is used in those earlier journals because documentation science is the precursor to information science, coming out of the work of Otlet and Fontaine. See Traité de Documentation (1934) and their work on the Mundaneum (1910). "Documentation" was the hot topic in the mid-20th century. Chemistry was one of the biggest data sets around (after legal cases), and much of the field we now call "cheminformatics" arose during the post-war era as a way to mechanize documentation management, first through punched cards and then through computers. Terms like "chemical descriptor" come directly out of this era, and the same researcher who coined both "descriptor" and "chemical descriptor" also coined the term "information retrieval", for an ACS conference. So I don't mean the abstract "we all record scientific information in some format", but I mean the historic evolution of this field as a branch of library science, with practitioners who work in libraries, and publication articles on how to manage their collection. (Eg, "The Charter: A "Must" for Effective Information System Planning and Design", http://dx.doi.org/10.1021/c160012a004 "It is the product of research work by information center managers, information system supervisors, technical report file custodians, and others who undertook information storage and retrieval efforts".) On the other hand, cheminformatics can also be interpreted the field which (among other things) uses methods of chemical information originally developed for documentation management in order to model chemical behavior. That's the "... and Modeling" of JCIM. Someone can have a successful career in that aspect of cheminformatics without knowing anything about the connections to library science. Which means a book about cheminformatics has to decide what "cheminformatics" means, hence my question. > Maybe we should teach IUPAC first again, Again, what is your purpose? What topics do you de-emphasize in order to teach more about IUPAC? And from what I hear, IUPAC has recently changed. > Check out Morgan's paper and some slides I made from that paper in teaching. I have read Morgan's paper. Amusingly, the ACS included it in final report of the NSF-funded work they did to develop and expand a computer-based Chemical Registry System, which means it's not behind a paywall. https://eric.ed.gov/?id=ED032214 , Appendix D, starting on PDF page 134. I also looked at your text at https://sharifsuliman1.medium.com/understanding-morgan-f70186b172f6 . Since the slides are a bit ambiguous about a few concepts, here are some other things to consider: "Well to do that he first decided he needed to come up with a rank ordering system, a way to sequentially at atoms in some sort of list for example for acetone:" He didn't come up with a rank ordering system. He came up with a unique rank. A non-unique rank ordering was in use in, eg, Ray and Kirsh's 1957 computer substructure search implementation, and in Mooers' 1951 theoretical description. "He chose to implement an old method of a Search Tree" I think you should point out that these concepts were new at the time. "Morgan decided the information would be stored in a series of 5 lists" One of the things that makes that paper difficult to understand is how it uses the compact connection table, which is a representation I think no one uses these days. Those 5 lists are part of that specific representation, but not essential to the algorithm. This representation came from Gluck's work at Du Pont ("A Chemical Structure Storage and Search System Developed at Du Pont", https://pubs.acs.org/doi/pdf/10.1021/c160016a008 , presented 1964, published 1965). Now, Gluck also had a canonicalization method, described in that paper as "The atom numbers in the bond columns are the newly assigned rank positions. The two Atoms No. 4 have different atom ranks associated with their single bonds. The iterative procedure which follows the initial ordering break ties according to the magnitudes of the atoms to which the tied atoms are bonded. .. This iterative process of reordering according to the new rank of the atoms in the bond columns continues until all atoms are uniquely ranked, in which case the compound is is canonical form, or until no further reordering is possible until ties still remain." You can see ties with the Morgan approach; Gluck then went to work at CAS with Morgan. The main problem being that Gluck's algorithm wasn't actually canonical. In "A Collection of Algorithms for Searching Chemical Compound Structure Analogs" at https://archive.org/details/DTIC_AD0460819/page/n19/mode/2up you can see Lehman's counter-example showing how the algorithm failed. The Morgan algorithm resolved that problem. "Essentially what you can do is start with a Radius of 0 around the atom." I'm concerned that you've mixed up the "Morgan invariant", as its described for ECFP-like fingerprints, with the algorithm that Morgan described in the paper. If you look at your radius=2 example, you'll see the 17 = 3*3 + (3+3+2), that is, the invariant for the initial carbon, squared, plus the sum of the invariants for the atoms at R=2 away. It no longer includes the R=1 invariants. You can see that even if the neighboring -OH has an initial invariant of 1,000, that value won't be part of the initial carbon's invariant. Instead, for purposes of teaching I would start with Penny codes, which is the paper immediately following Morgan's in the same issue, at https://pubs.acs.org/doi/pdf/10.1021/c160017a019 . On page 11 of that same "A Collection of Algorithms for Searching Chemical Compound Structure Analogs" link at https://archive.org/details/DTIC_AD0460819/page/n19/mode/2up you can read about Penny codes. Penny, in a recent paper, recognizes correctly that atom and bonding considerations alone are in some cases inadequate for distinguishing compounds. His method is concerned with enumerating the simple connectivity in the neighborhood of each atom. As he states, "it is a unique expression of the atomic network within the immediate neighborhood of the subject atom and is an attribute of the atom as much as its chemical identity". Page 12 then goes into more detail. You'll see these are much more in line with your description. I personally think RDKit's use of "Morgan" fingerprint should be "Penny" fingerprint, but I know that's a predilection of mine. > It's weird to me that data structures is not a core requirement for > cheminformatic folk. Like all interdisciplinary fields, cheminformatics uses only a subset of a larger topic of "data structures", and has some specialized needs not covered by normal introductory classes. I have a CS degree. Data structures as taught by computer scientists include many topics I have not yet needed in cheminformatics. I've never needed to care about B-tree implementations. Or red-black balanced binary trees. I don't think I've even had to use Dijkstra's algorithm, which is pretty molecular-graph-adjacent. While intro data structure classes don't teach substructure isomorphism algorithms. And I think Bloom filters (conceptually related to molecular fingerprints) is also a more advanced topic. On the other hand, I have used concepts I learned in automata theory. So while I completely support the idea that a cheminformatics textbook should include a deeper treatment of graph theory than, say, the 5 pages Gasteiger gives in his textbook, I also complete support the idea that a semester-long general-purpose programming course, plus a semester long data structures course, isn't appropriate. Cheers, Andrew da...@dalkescientific.com _______________________________________________ Blueobelisk-discuss mailing list Blueobelisk-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss