Re: [Rdkit-discuss] Lynch-Willett algorithm

Mikolaj Kowalik Tue, 31 Dec 2013 17:20:39 -0800

On Tue, 31 Dec 2013 06:58:12 +0100
Greg Landrum <greg.land...@gmail.com> wrote:


> > LW algorithm is based on Morgan algorithm:
> >
> > 1) Initially, each atom is assigned an index i.e. an integer $V^{1}_{i}$
> > derived from its type and bond pattern (I know, it's rather vague but I'm
> > practically quoting the sources. Using original Morgan approach, node
> > connectivity, will do.)
> >
> > 2) Higher order indices
> > \[
> >     V^{n}_{i} = 2 * V^{n - 1}_{i} + \sum_{k} V^{n-1}_{k}
> > \]
> > where summation is over all atoms adjacent to atom $i$, are calculated
> > simultaneously for reactant and product until there is NO such pair for
> > which $V^{n}_{r} = V^{n}_{p}$ ($r$ enumerates atoms of the reactant, $p$
> > atoms of the product).
> >
> > 3) All atom pairs, for which $V^{n - 1}_{r} = V^{n - 1}_{p}$ are
> > considered to be in centers of identical circular substructures of radius
> > $(n - 2)$ bonds. Atoms belonging to those substructures are deleted from
> > the reactant and product.
> >
> > 4) The above process is repeated until all substructures common to the
> > reactant and product are removed.
> >
> > The algorithm's idea seems to be relatively straightforward, but I am
> > stuck on the first example from Ref. [1]. Authors consider a reaction
> > which can be written as
> >
> >     CC(=O)CC(C)C(CC#N)C(=O)N>>CC(=O)CC(C)C(CC#N)C#N
> >
> > The goal is to identify the change C(=O)N>>C#N.
> >
> > After applying steps 1) through 3) for the first time we end up with an
> > intermediate reaction diagram
> >
> >     CC#N.C(=O)N>>CC#N.C#N
> >
> > No problem here, my current implementation of LW algorithm arrives to this
> > point as expected. But according to step 4), we should 'rinse and repeat'
> > to eliminate the remaining common fragment 'CC#N'.
> >
> > And here is exactly my problem. I do not see how the algorithm's stopping
> > condition can be reached if there are two identical, yet disconnected
> > fragments. As I see it, whatever initial values indices we assign to the
> > remaining atoms, the respective atoms on common fragment 'CC#N' will have
> > *always* the same indices if we use the increment formula above. Thus,
> > algorithm will enter an infinite loop, and actually does.
> 
> If you "rinse" by re-calculating the atom invariants *based on the new bond
> patterns*, I think the algorithm should, during the second pass, identify
> that the two CC#N fragments are identical and remove them from the diagram.
> This should leave you with the intended result.

I think I do exactly as you wrote. I remove relevant atoms and reset initial
values of indices based on their current bond pattern (step 1) and continue
calculating higher order indices as described in step 2, see table below,
where, after Funatsu et al. Tetrahedron Computer Methodology, 1, 53--69 (1988)
\[
    V^{1}_{i} = 10 * \text{atomic number} + \text{number of adjacent bonds}
\]
instead of just number of adjacent bonds as in Morgan algorithm.

n |    C    C    N    O    C    N |    C    C    N    C    N | k
----------------------------------------------------------------
1 |   61   62   71   81   62   71 |   61   62   71   61   71 | 4
2 |  184  256  204  224  276  204 |  184  256  204  193  335 | 7
3 |  624  900  664  724  980  684 |  624  900  664  721 1391 | 8
4 | 2148 3088 2228 2428 3368 2348 | 2148 3088 2228 2833 5615 | 8

But still, unless the disjoint fragments starts to "see" each other
somehow, I don't know how the algorithm's stopping condition can be reached
as atoms belonging to CC#N fragments will have the same indices for any $n$.


> It may be that you're expressing step 2 incorrectly. There's a compact
> explanation of the algorithm on page 564 of the review from Chen et. al
> that has the key stopping point that you may be missing in step 6: if the
> number of unique atom classes does not increase from one iteration to the
> next, terminate the iterations.

As far as I understand the explanation on page 564 describes the original
*Morgan* algorithm (the heading may be misleading). The actual Lynch-Willett
procedure is described on the next page and this requirement is not mentioned
at all, at least explicitly.

Though, this approach (checking either LW *or* Morgan stopping condition)
could work in this case. The last column in the table above indicates the
total number of unique indices. So after stopping at $n = 3$, we would have
the match radius $r \equiv n - 2 = 1$ bond and established equivalences
between all atoms of the CC#N fragments.

However, I'm afraid it may be a stretch. Unless I'm misinterpreting
something, the original paper by Lynch and Willett indicates that after
second "iteration" match radius is equal 3 and we should get the equivalence
(equal values of indices) only between nitrogen atoms of CC#N fragments. The
first "iteration" proceeds exactly "by the book", so those discrepancies in
the second are bit concerning.

Anyway, I really appreciate your input Greg and if you or anyone else has any
other suggestions I will be happy to hear them. 

Best,

-- 
Mikołaj Kowalik

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Lynch-Willett algorithm

Reply via email to