Re: [Fis] Chemical information: a field of fuzzy contours ?

Igor Gurevich Wed, 21 Sep 2011 03:23:38 -0700

2011/9/16 Michel Petitjean <petitjean.chi...@gmail.com>:
> Chemical information: a field of fuzzy contours ?
> -------------------------------------------------
>
> Before turning to chemistry, I would recall some facts that I noticed
> on the FIS forum:
> although many people consider that a unifying definition of
> information science is possible (to be constructed),
> a number of other people consider that there are many concepts of
> information which are not necessarily
> the facets of an unique concept, so that it could be better to speak
> about "information scienceS",
> and not about "information science".
> I can read on http://en.wikipedia.org/wiki/Information_science
> << Information science is an interdisciplinary science primarily
> concerned with the
> analysis, collection, classification, manipulation, storage, retrieval
> and dissemination of information. >>
> and some fewer lines above:
> << Information Science consists of having the knowledge and
> understanding on how to collect, classify, manipulate, store, retrieve
> and disseminate any type of information. >>
> Clearly, collecting, storing, and retrieving information let us think
> that we must deal with databases.
> The question "where is information" is neglected, although answering
> it is enlighting:
> no doubt that much information is stored in data banks.
> There are strong connections of Information Science(s) with Data
> Mining (DM) and Knowledge Discovery in Databases (KDD).
>
> Is the situation clearer in chemistry ?
>
> Undoubtly there is a field of chemical information.
>
> The ACS (American Chemical Society) has a Division of Chemical
> Information (CINF),
> named as such in 1975, but which in fact goes back to 1943
> (http://www.acscinf.org/).
> CINF is active and organizes various meetings which can be retrieved on the 
> web.
> Visit also http://www.libsci.sc.edu/bob/chemnet/chchron.htm, an
> informative website.
>
> The ACS publishes the "Journal of Chemical Information and Modeling"
> renamed so in 2005
> after having been named "Journal of Chemical Information and Computer
> Sciences" from 1975 to 2004,
> itself being the continuation of the "Journal of Chemical
> Documentation" from 1961 to 1974.
> In fact, it is the same journal (one volume per year), which turned to
> chemical information the same year that CINF received his actual name.
>
> Interestingly, still in 1975, the main cheminformatics lab in France
> (in fact the only one in France at this time) was renamed.
> The old name was LCOP ("Laboratoire de Chimie Organique Physique"),
> and the new name was ITODYS, still in vigor,
> meaning until 2001: "Institut de TOpologie et de DYnamique des
> Systemes". This name, which can be understood in English due
> to the close similarity between the French and the English words, was
> partly due to the existence of a distance in the molecular graphs
> (this distance is the smaller number of chemical bonds separating two
> atoms), and as known, a distance induces a topology:
> it clearly acknowledged the cheminformatics aspects of the research
> performed in the lab.
>
> Chemical Information Science, which is sometimes named Chemical Informatics
> (http://www.indiana.edu/~cheminfo/acs800/soced_wash.html)
> can be reasonably considered to be a part of the Cheminformatics field.
> This latter is defined on Wikipedia
> (http://en.wikipedia.org/wiki/Cheminformatics):
> "Chemoinformatics is the mixing of those information resources to
> transform data into information and
> information into knowledge for the intended purpose of making better
> decisions faster in the area of
> drug lead identification and optimization".
> This definition, dated from 1998, clearly acknowledges the extraction
> of information from data,
> but it is restrictive since it discards all pioneering works about
> computerization of chemical databases,
> including structural formulas coding and structural motifs retrieval,
> which historically cannot be denied
> to be the core of the cheminformatics field.
>
> Now let me write more lines about the story of cheminformatics in France,
> which is a bit funny but enlights the debate on the definition on the
> field of chemical information.
> The French pioneer was Jacques-Emile Dubois (1920-2005), founder of
> the LCOP and of the ITODYS,
> who published his first cheminformatics paper in 1966. One of his main
> ideas was to use the concept
> of concentric layers in the molecular graphs: the nodes are the atoms
> and the edges are the bonds,
> the neighbours of a node constitute the first concentric layer around this 
> node,
> the next neighbours constitute the second layer, and so on.
> This concept was known to mathematicians such as Cayley and Polya.
> Here, the challenge was to explain to experimental chemists that in a
> number of applications, such as QSAR
> (Quantitative Structure-Activity Relationship), the use of sets of two
> concentric layers around focus atoms
> may be more efficient that the usually taught approaches based on
> squeletons and substituents.
> Dubois also thought that this concept could help to retrieve rapidly a
> chemical motif in a large
> database of structural formulas. An efficient solution of this rapid
> retrieval problem was found
> by Roger Attias at the end of the 70's, and was known under the name
> "DARC system".
> In fact, DARC was initially the name of a linear code of structural
> formulas issued from the,
> works of Dubois at the end of the 60's, but the DARC code was never
> part of the DARC system.
> However, the DARC system is still used at the Questel Company for
> storage of Markush formulas
> (i.e., generic formulas covering a combinatorial list of structural formulas)
> and retrieval of Markush formulas and motifs in patent databases:
> this is a highly technical subfield of cheminformatics.
> It is of interest to mention that Attias had followed both chemistry courses 
> and
> computer sciences courses, thus demonstrating that pluridisciplinarity
> is important,
> a fact which is commonly admitted now, but which was not so well
> accepted in the 70's.
> In the 80's, the cheminformatics field was in France mainly promoted
> by Dubois, but,
> due to various difficulties to communicate with experimentalists, the
> field was progressively neglected
> and was no more supported at the end of 90's by the CNRS (Centre
> National de la Recherche Scientifique),
> the main public Organization funding the ITODYS in addition to the
> University Paris 7.
> The situation was so bad that the field, first named "Informatique
> Chimique" (i.e., it is computer sciences
> applied to chemistry), was in the 90's renamed "Chimie Informatique"
> (i.e., it is chemistry with
> computer sciences aspects), and at the beginning of the 2000's these
> terms were completely prohibited
> and the cheminformatics activities were called just "modeling".
> At the same time (January 2001), the meaning of ITODYS was changed to
> "Interfaces, Traitements, Organisation et Dynamique des Systemes": no
> more reference to cheminformatics.
> In fact, the cheminformatics activities completely disappeared from
> the ITODYS and from the Chemistry Department of
> the University Paris 7 when I leaved it in 2007: all other
> cheminformaticians of the lab had migrated, or were retired, or died.
>
> The main encountered problem was: where classify cheminformatics ?
> This classification is of crucial importance for evaluation and funding.
> At the beginning cheminformatics falled in Organic Chemistry (Dubois
> was primarily an Organic Chemist),
> and at the end it falled in Quantum Chemistry, due among other reasons
> to some programming aspects
> that were viewed from far to be identical in both fields.
> But most Organic Chemists did not care about programming and most
> French Quantum Chemists paid little attention
> to molecular graphs and related stuff.
> In all cases, it was assumed that cheminformatics clearly dealt only
> with chemistry, and anyway was
> considered as a tiny field. Most cheminformaticians were isolated
> within some chemistry labs,
> for which no key word recalling any cheminformatics activity was
> likely to appear on official labs documents.
>
> At the end of 2006, I learned that a group of structural
> bioinformaticians were much interested in chemoinformatics.
> They dealt with virtual screening, ligand based techniques, etc.,
> and they had the same problems of visibility of their field than the
> cheminformaticians.
> Then, in 2007, they attend a cheminformatics meeting in Strasbourg
> organized by Alexandre Varnek, a quantum chemist who turned to
> cheminformatics.
> There were also several members of private companies such as Novartis,
> Sanofi-Aventis, Servier.
> A very positive result emerged from the meeting: the SFCi was created
> (Societe Francaise de Chemoinformatique,
> http://www.cheminformatique.fr/).
> The Board, containing 9 members, is a mix of chemists and structural
> bioinformaticians, and also a mix of academics and members of private
> companies
> All these people have interest to cheminformatics.
> This creation shows how the two communities of cheminformaticians
> merged: those coming from chemistry, and those coming from structural
> bionformatics.
> Technical remark: during years, among the cheminformaticians, the
> chemists deal mostly with molecules represented by graphs,
> although the structural bionformaticians dealt mostly with molecules
> represented by sets of points in the 3D space
> (this is a simplified view of what is really used, of course).
> In 2010, the SFCi got a financial support from the INSERM (a public
> organization funding medical research), and in 2011 the CNRS,
> which in the past bannished even the words defining cheminformatics,
> is reintroducing the field via a GDR (Groupement de Recherche).
> Note that cheminformatics was reintroduced in Paris 7 University in
> 2009, but in the Life Sciences Department
> (at the MTi, an INSERM funded Lab), not in the Chemistry Department.
> Cheminformatics is an expanding pluridisciplinary field, but it is
> victim of an old view of how the disciplines should be classified.
> Some experimentalists even consider that what is done by computer is
> just a technical help, and by no way could be a science:
> afterall, their children are playing with some PC at home, so these
> experimentalists (fortunately few) may deduce that working on computer
> should be easy.
>
> Since the creation of the SFCi in 2007, annual meetings and other
> cheminformatics events are successfully organized in France,
> and almost each time we discuss about the definition of the field,
> without being able to agree with a common one !!
> During many years, the specificity of cheminformatics was the handling
> of molecular graphs, but owing to the story above,
> it is just a part of the field. Thus, at various occasions, I proposed
> to characterize cheminformatics via the handling
> (includes retrieval, etc.) of medium or large collections of
> molecules, i.e. the chemical database aspect is the main specificty of
> cheminformatics.
> It is in agreement with what is suggested in Wikipedia, and it fully
> matches both the pioneering works in the field and the actual research
> activities.
> Cheminformaticians are often requested by chemists to tell what are
> the frontiers of their field.
> So, at the occasion of some talks in front of various commitees, I
> proposed the following:
> * Difference between cheminformatics and modeling: cheminformatics is
> a subfield of modeling.
> In fact, any time somebody writes a math formula or even a chemical
> formula, it is modeling.
> * Difference with "theoretical chemistry" (i.e., quantum chemistry,
> molecular mechanics, etc.):
> theoretical chemists considers only a limited number of molecules at a time.
> * Difference with chemometrics: chemometrics does not involve the
> molecule by itself; at best, it involves some of its properties,
> but it does not involve the molecule as a whole data.
> Here, I mention that the definition found on Wikipedia
> (http://en.wikipedia.org/wiki/Chemometrics) let chemical information
> be strongly connected with chemometrics: "Chemometrics is the science
> of extracting information from chemical systems by data-driven means".
> If the extracted information is considered to be "chemical
> information", then a consequence of this definition
> is that chemometrics should be a subfield of cheminformatics. That
> consequence may shock some chemometricians.
> But the content of the Wikipedia webpage focusses on methods which do
> not refer to the molecules themselves:
> they rather refer to associated chemical data, such as
> physico-chemical or biological properties.
> Note here that data analysis methods and data mining techniques are
> rather inadequate when the primary data
> is a graph or a whole set of points, or both, or something more complex.
> Some examples illustrating this inadequation are:
> having a database of one million of different graphs, can we display
> it graphically ?
> Can we exhibit a mean graph ? Can we evaluate the dispersion about the
> mean graph (if defined) ?
> Can we exhibit "extreme" graphs (apart single nodes) ? Can we
> correlate graphs with numbers ?
> Clearly, we need specific methods to answer these questions, and to
> their analogs in the 3D case.
>
> A typical cheminformatics class of problems which at first glance
> seems unrelated to chemical databases
> is the assignment of stereoisomery. E.g., flagging as R or S the
> configuration of an asymetric carbon,
> flagging as E or Z the stereoisomery induced by a C=C double bond, etc.
> In theory, such problems are solved by the application of the Cahn,
> Ingold and Prelog (CIP) priority rules of substituents,
> which are commonly taught to undergratuate chemistry students.
> Unexperienced chemists may think that it is easy, and most time it is.
> But it becomes a very hard cheminformatics problem when we face to a
> chemical database for which
> the computer programme does not "see" the molecules: because the
> programme cannot know what molecules
> it will read, it should work in any case, at least deciding when a
> stereo flag can be computed
> and when it can't. It involves a deep understanding of the coding of
> the molecules,
> including mesomery and tautomery problems, etc. Even deciding about
> aromaticity can be difficult
> in some cases. In fact, the CIP rules are based on the assignment of
> priority on atoms,
> a problem closely related to the one of defining a canonical numbering
> of the atoms, very hard to solve.
> Again, in simple cases, it is easily done (e.g. for nomenclature purposes),
> but think to the difficulties faced at CAS (Chemical Abstracts
> Service) to computerize all molecules
> described in the literature (until now, more than 60 millions were recorded).
> Even defining what is a chemical compound is a difficult challenge: it
> is why the CAS Registry Number (RN) was created.
> Storing structural formulas on computer is the historical core problem
> of cheminformatics,
> and obviously it is in a context of databases building. The main
> question is: when something
> is stored on the computer, what does the resulting records mean ?
> E.g., what means some bond value ?
> Can a given bond value be understood without looking to some adjacent
> bonds values and other data ?
> What means the content of the records ? In fact what we needed was to
> store primary chemical information,
> and for that we needed to build rigorous coding rules, so that the
> chemist reading the records can
> understand what the content of these records really meant.
>
> For those who are unexperienced with databases in general, I consider
> one of the simplest examples commonly encountered:
> build a directory of names, say, of some colleagues. But some women
> have two names, some people have surnames
> (the "first name/family name" structure is culture and country
> dependant), Russian people may have various English translations of
> their name,
> Chinese people too, etc. If you add to that the physical addresses,
> building the directory can be a real nightmare:
> did it happen that you were unhappy with a form in which your own data
> cannot fit with what was expected ? Probably yes.
> When the computerized database is read, it is clear that its content
> should be understood at the information level:
> what meant exactly each field of an entry in the directory. Without
> this understanding, wrong conclusions can be
> derived when reading the entries, and complex information to be
> extracted can be biased, if not erroneous.
> Now imagine the difficulties with complex databases such as structural
> chemical databases !
> Remark: once the data recorded in the database, you can reformat the
> database as many times you like,
> but it is too late to change the meaning of its content: this latter
> should be carefully thought before recording any data.
> Building and updating a database is generally expensive, and deciding
> about the meaning of its content to be stored is a crucial step.
> Alas, ambiguities and problems are often too late discovered.
> I may say that focussing on formatting and structure of databases is
> part of computer sciences,
> although focussing on the meaning of what is stored is rather part of
> information science.
>
> Now we return back to chemistry. Cheminformatics is related to
> chemical databases, and chemical information is to be extracted from
> these databases.
> Recall that information is not to be confused with data, and that
> extracting information from data can be complex, but not always.
> I would just mention one example of what could be considered to be a
> chemical information, and that I called the parity phenomenon.
> In 1990, I published in J. Chem. Inf. Comput. Sci. the distribution of
> the number of carbons per compound in a database
> of 3.424.428 compounds. This database contained most compounds
> recorded by CAS until July 1978.
> It appeared that the even values were systematically "preferred" to
> the odd values
> (http://petitjeanmichel.free.fr/itoweb.petitjean.graphs.html#PARITY).
> In fact it is not due to a coherent action of all chemists over the
> world: the main part of the explanation
> (published in the 1990 paper) relies on graph theory, although the
> database emerged from the human activities.
> In 1996 and 1997, the phenomenon was rediscovered in the Belstein
> database, giving raise to four notes in Nature,
> none of them mentioning the explanation (the original 1990 paper was not 
> cited).
> Here comes my question: in the example above, is the chemical
> information the parity phenomenon by itself,
> or is the chemical information its explanation, or is it something else ?
> Anyway, this chemical information emerged from the chemical database,
> and couldn't be retrieved without it.
>
> Chemical information science is a subfield of information science
> dealing with molecules,
> and there is a close relation of information science(s) with
> databases, in general. If not, where is information ?
> May be in our head, and not only it can be communicated (e.g.,
> teaching), but also it can be stored (not trivial to do properly).
> During centuries it was stored in books. Now, it is in computer
> databases, giving us a faster access to an enormous richness.
> We just need people and time to investigate all that: an exciting task !
>
> All web sites cited in the text above were accessed the 12 September 2011.
>
> Michel Petitjean
> MTi, INSERM UMR-S 973, University Paris 7,
> 35 rue Helene Brion, 75205 Paris Cedex 13, France.
> Phone: +331 5727 8434; Fax: +331 5727 8372
> E-mail: petitjean.chi...@gmail.com (preferred),
> michel.petitj...@univ-paris-diderot.fr
> http://petitjeanmichel.free.fr/itoweb.petitjean.graphs.html
>
> _______________________________________________
> fis mailing list
> fis@listas.unizar.es
> https://webmail.unizar.es/cgi-bin/mailman/listinfo/fis
>

Dear Colleagues,

Chemical information is not a field of fuzzy contours!

In 1968 Ursul A.D. based on philosophical considerations, gave the
single, unified, unique definition of information. «…Information
expresses property of a substance which is general. … Concept of
information is information reflexion as objective-real property of
objects not dependent on the subject lifeless and wildlife, a society,
and property of knowledge, thinking …, thus, is proper both to
material, and ideal. It is applicable and to the substance
performance, and to the consciousness performance. If objective
information can be considered as property of substance the ideal,
subjective information is reflexion of objective, material
information… V.M. Glushkov characterises the information as
heterogeneity in energy distribution (or substances) in space and in
time … The information exists so far as there are the material bodies
and, therefore, the heterogeneities created by it». (Ursul A.D. Nature
of the Information: the Philosophical essay. Moscow. Political
publishing house. 1968. 288 p.). “Information is heterogeneity, stable
for some definite time. Regardless of the nature of heterogeneity,
would be it letters, words, phrases or - elementary particles, atoms,
molecules, or - people, groups, societies, etc”. (Gurevich I.M. Law of
informatics - a basis for research and design of complex communication
and management systems. (In Russian). «Ecos». Moscow. 60 p.).
If we use different definitions of information we will receive
assessments, results that can not be compared, which is impossible to
generalize.
The measure of the degree of heterogeneity or information is Shannon's
information entropy (Shannon, 1948) and other information
characteristics (information divergence, joint entropy, communication
information, differential information capacity).
The proposed definition and the information characteristics can
describe information (heterogeneity) of any nature.
The rationale of this definition of information can be based on the
laws of development (evolution) of the Universe.
From 13,7 ± 0,13 billion years ago to 3.8 billion years ago (about 10
billion years) there was no life in the Universe. There was
information only in the form of physical and chemical heterogeneities.
Its existence does not depend on Existence of Observer (Díaz &
Pérez-Montoro, 2010 Is information a sufficient basis for cognition?
(Part 2). The heterogeneities (elementary particles, atoms, molecules,
…) possess certain information (and physical) characteristics,
properties (information properties of the first order), in particular
they contain certain volume of information. Interaction of
heterogeneities leads to change of their information characteristics.
Observer appeared (very approximately) some billion years ago and gave
new properties (information properties of the second order) to
information in the form biological – heterogeneities, created by life
– Perception (3.8 billion years ago), Memory, Formation (Creation),
Consciousness, Thinking (200 million years ago), Imagination, Mind,
Intelligence, Knowledge, Cognition, Representation, Content, Meaning,
Value. ... (some million years ago), …
Informatics is the Science of Information. The new synthetic
discipline which is uniting the physicist and information theory was
given the name «Physical Informatics». Physical Informatics - the
science which use information methods for research the natural
systems. Physical Informatics is modern Science of Information in
Physical and Chemical Systems, including Quantum Informatics, and is
the basis of Informatics of the Living Systems. Chemical Informatics
is part of Physical Informatics.
Among the main questions of Physical Informatics are:
- Estimates of volume of information in the physical, chemical and
biological systems (fundamental and elementary particles, atoms,
molecules, gases, liquids, solids, stars, black holes ,..., RNA, DNA,
cells, viruses, organisms, ..., the Universe). For example, volume of
information in a molecule is the sum of volumes of information in the
atoms and the volume of information in the structure of a molecule.
- Informational constraints on the formation, development,
interconversion of the fundamental and elementary particles, atoms,
molecules, gases, liquids, solids, stars, black holes ,..., RNA, DNA,
cells, viruses, organisms, ...
- Fundamental limitations on memory capacity and productivity of
information systems.
You can see the first results of Physical Informatics (Gurevich I.M.
«The information characteristics of physical systems». (In Russian).
«Сypress». Sevastopol. 2010. 260 p.
http://www.ipiran.ru/gurevich/Gurevich_info_charasteristics_rus_book.pdf
http://www.ipiran.ru/publications/publications/gurevich/informatic/index.asp.
http://glossarium.bitrum.unileon.es/system/app/pages/search?q=Information+as+heterogeneity&scope=search-site).
Using the single, unified, unique definition of information,
properties of information, first results of Physical Informatics we
can create the science of information – informatics, which describes
including chemical systems.

Best Wishes,
Igor Gurevich

_______________________________________________
fis mailing list
fis@listas.unizar.es
https://webmail.unizar.es/cgi-bin/mailman/listinfo/fis

Re: [Fis] Chemical information: a field of fuzzy contours ?

Reply via email to