On 10/12/11 7:53 AM, Chakravarthy Marella wrote: > Hello, > > I am working on the problem of comparing SMILES strings based on > alignment. In my research, I came across the following problem: > > Lets say there are two SMILES strings, > > SMILES 1: CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO > SMILES 2: NC(C(CCCCCCCCCCC)O)CO > > and we want to see how similar are those two smiles strings are ? If they > are similar, is there any fragment (or sub-structure) that is common ? > > First, I removed redundancy by converting above SMILES to unique SMILES, > by OB's canonical SMILES algorithm > > SMILES 1 (Unique): CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO > SMILES 2 (Unique): CCCCCCCCCCCC(C(CO)N)O > > If above SMILES are aligned, I get the following > > CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC-)--O)-CO---- > -----------------------CCCCCCCCCCCC-(C--(CO)N)O > > However, if you can notice, SMILES 2 is nothing but, one half of SMILES 1. > I will just add few empty spaces before SMILES 2 to illustrate this, > > SMILES 1: CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO > SMILES 2: NC(C(CCCCCCCCCCC)O)CO > > The reason why common fragment of "NC(C(CCCCCCCCCCC)O)CO" is not finding > it's place in first alignment is because of underlying canonical > algorithm.
This is not a valid use of SMILES or canonical SMILES. It's purely a coincidence that one canonical SMILES is a substring of the other. There is nothing in the canonicalizer that guarantees this. In fact, it is theoretically impossible to ensure that the canonical SMILES of two related structures will have common substrings. If that were true, then you could use the SMILES as a substructure search, which would violate the proven NP-Completeness of graph isomorphism (substructure searching). For more about this, see: http://www.emolecules.com/doc/cheminformatics-101-substructure-search.php > I would like to know if there a way to generate SMILES > (programmatically), such that, for any given pair of SMILES strings, > common fragments find back their place after alignment ? > > In other words, I am looking for a SMILES generator algorithm, which > always returns "NC(C(CCCCCCCCCCC)O)CO" instead of "CCCCCCCCCCCC(C(CO)N)O" > in above example ? > > Any suggestions on how to do this ? or pointers to previous work would be > gratefully acknowledged It can't be done. Using NP-complete theory, you can demonstrate that it's mathematically impossible. See Noel's previous reply for a better approach to this problem. Craig > > TIA > Varthy > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2d-oct > _______________________________________________ > OpenBabel-discuss mailing list > OpenBabel-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openbabel-discuss > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ OpenBabel-discuss mailing list OpenBabel-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-discuss