On Tue, Apr 24, 2012 at 8:22 AM, Jörg Kurt Wegner <joergkurtweg...@gmail.com> wrote: > Third, I would highly recommend that we replace the tautomerization > framework with an alternative solution, e.g. the SMIRKS ennumeriation > from Markus Sitzman. The SMIRKS patterns are part of his publication > Article (sin10) > Sitzmann, M.; Ihlenfeldt, W.-D. & Nicklaus, M. C. > Tautomerism in large databases > J Comput Aided Mol Des, 2010, 24, 521-551 > DOI 10.1007/s10822-010-9346-4 > PMID 20512400
For those of you who haven't read it, here's a link: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2886898/ This is a very nice paper and methodology they've developed, and it's described very clearly in the paper. However, their SMIRKS rules are (by their own admission) somewhat unrealistic for real-life use. "The price for our comprehensive approach is that we may, in some cases, tautomerically equate structures with each other that have such a high energy barrier for interconversion that they are in reality separate, stable compounds that do not interconvert even long-term." More importantly, by adopting such a broad definition of tautomers, they end up with a sort of combinatorial explosion of results. While this is interesting from a research point of view (their results are very impressive), in a real cheminformatics system this is probably too expensive and only gives marginally better results than a more restricted set of SMIRKS. If OpenBabel adopts this approach, we should provide a way for the user to select which SMIRKS to use (maybe a user-editable data file with the SMIRKS). Jörg Kurt Wegner wrote: > In other words, as defined in the SMIRKS and ranking rules, we need > just a recursive execution, store the unique canonical SMILES, rank > them, and take the highest scoring as tautomeric SMILES. The one aspect of algorithms described by Sitzmann et al that I didn't care for is their overall scoring. It's a nice technique and probably works well (in the sense that it produces a reliable canonical tautomer), but it ignores the fact that most tautomers have a preferred form (perhaps its the most "real" form, or the most stable at normal pH and temperature, or perhaps it's just an aesthetic choice). Sitzmann's rules (see Table 2: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2886898/table/Tab2/) don't seem to take this into account. They are based on the overall properties of a molecule and don't even consider the specific SMIRKS that actually matched the molecule. Suppose instead that you defined the right-hand side of each SMIRKS as a "preferred" form. You could then start the ranking process by simply counting up how many preferred forms were in each tautomer. My guess is that in most cases this would immediately point to the preferred tautomer. But there would still be "ties" that need to be broken. Another thing bothers me about Sitzmann's method is that after the rules (from Table 2) are applied, remaining ambiguity is resolved arbitrarily: "If more than one tautomer gets the maximum scoring, the tautomer with the largest hash code value is, quite arbitrarily from a structural point of view, selected as the canonical tautomer form." This certainly works well, but it places a big dependency on the specific hash algorithm. It seems to me that a better approach would be to base the selection on the actual SMILES. For example, from the tautomers with equal scores, one could simply sort the canonical SMILES lexically and select the first one from the sorted list. If we ever get to the point of writing a paper about how tautomers are normalized in OpenBabel, this would be much easier to document. Craig ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel