On 10/12/11 7:53 AM, Chakravarthy Marella wrote:
> Hello,
>
> I am working on the problem of comparing SMILES strings based on
> alignment. In my research, I came across the following problem:
>
> Lets say there are two SMILES strings,
>
> SMILES 1: CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO
> SMILES 2: NC(C(CCCCCCCCCCC)O)CO
>
> and we want to see how similar are those two smiles strings are ? If they
> are similar, is there any fragment (or sub-structure) that is common ?
>
> First, I removed redundancy by converting above SMILES to unique SMILES,
> by OB's canonical SMILES algorithm
>
> SMILES 1 (Unique): CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO
> SMILES 2 (Unique): CCCCCCCCCCCC(C(CO)N)O
>
> If above SMILES are aligned, I get the following
>
> CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC-)--O)-CO----
> -----------------------CCCCCCCCCCCC-(C--(CO)N)O
>
> However, if you can notice, SMILES 2 is nothing but, one half of SMILES 1.
> I will just add few empty spaces before SMILES 2 to illustrate this,
>
> SMILES 1: CCCCCCCCCCCCCC(=O)NC(C(CCCCCCCCCCC)O)CO
> SMILES 2:                   NC(C(CCCCCCCCCCC)O)CO
>
> The reason why common fragment of  "NC(C(CCCCCCCCCCC)O)CO" is not finding
> it's place in first alignment is because of underlying canonical
> algorithm.

This is not a valid use of SMILES or canonical SMILES. It's purely a 
coincidence that one canonical SMILES is a substring of the other.  There is 
nothing in the canonicalizer that guarantees this.

In fact, it is theoretically impossible to ensure that the canonical SMILES of 
two related structures will have common substrings. If that were true, then you 
could use the SMILES as a substructure search, which would violate the proven 
NP-Completeness of graph isomorphism (substructure searching).  For more about 
this, see:

   http://www.emolecules.com/doc/cheminformatics-101-substructure-search.php

> I would like to know if there a way to generate SMILES
> (programmatically), such that, for any given pair of SMILES strings,
> common fragments find back their place after alignment ?
>
> In other words, I am looking for a SMILES generator algorithm, which
> always returns "NC(C(CCCCCCCCCCC)O)CO" instead of "CCCCCCCCCCCC(C(CO)N)O"
> in above example ?
>
> Any suggestions on how to do this ? or pointers to previous work would be
> gratefully acknowledged

It can't be done. Using NP-complete theory, you can demonstrate that it's 
mathematically impossible.

See Noel's previous reply for a better approach to this problem.

Craig

>
> TIA
> Varthy
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> OpenBabel-discuss mailing list
> OpenBabel-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-discuss
>


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to