On Tue, 31 Dec 2013 06:58:12 +0100 Greg Landrum <greg.land...@gmail.com> wrote:
> > LW algorithm is based on Morgan algorithm: > > > > 1) Initially, each atom is assigned an index i.e. an integer $V^{1}_{i}$ > > derived from its type and bond pattern (I know, it's rather vague but I'm > > practically quoting the sources. Using original Morgan approach, node > > connectivity, will do.) > > > > 2) Higher order indices > > \[ > > V^{n}_{i} = 2 * V^{n - 1}_{i} + \sum_{k} V^{n-1}_{k} > > \] > > where summation is over all atoms adjacent to atom $i$, are calculated > > simultaneously for reactant and product until there is NO such pair for > > which $V^{n}_{r} = V^{n}_{p}$ ($r$ enumerates atoms of the reactant, $p$ > > atoms of the product). > > > > 3) All atom pairs, for which $V^{n - 1}_{r} = V^{n - 1}_{p}$ are > > considered to be in centers of identical circular substructures of radius > > $(n - 2)$ bonds. Atoms belonging to those substructures are deleted from > > the reactant and product. > > > > 4) The above process is repeated until all substructures common to the > > reactant and product are removed. > > > > The algorithm's idea seems to be relatively straightforward, but I am > > stuck on the first example from Ref. [1]. Authors consider a reaction > > which can be written as > > > > CC(=O)CC(C)C(CC#N)C(=O)N>>CC(=O)CC(C)C(CC#N)C#N > > > > The goal is to identify the change C(=O)N>>C#N. > > > > After applying steps 1) through 3) for the first time we end up with an > > intermediate reaction diagram > > > > CC#N.C(=O)N>>CC#N.C#N > > > > No problem here, my current implementation of LW algorithm arrives to this > > point as expected. But according to step 4), we should 'rinse and repeat' > > to eliminate the remaining common fragment 'CC#N'. > > > > And here is exactly my problem. I do not see how the algorithm's stopping > > condition can be reached if there are two identical, yet disconnected > > fragments. As I see it, whatever initial values indices we assign to the > > remaining atoms, the respective atoms on common fragment 'CC#N' will have > > *always* the same indices if we use the increment formula above. Thus, > > algorithm will enter an infinite loop, and actually does. > > If you "rinse" by re-calculating the atom invariants *based on the new bond > patterns*, I think the algorithm should, during the second pass, identify > that the two CC#N fragments are identical and remove them from the diagram. > This should leave you with the intended result. I think I do exactly as you wrote. I remove relevant atoms and reset initial values of indices based on their current bond pattern (step 1) and continue calculating higher order indices as described in step 2, see table below, where, after Funatsu et al. Tetrahedron Computer Methodology, 1, 53--69 (1988) \[ V^{1}_{i} = 10 * \text{atomic number} + \text{number of adjacent bonds} \] instead of just number of adjacent bonds as in Morgan algorithm. n | C C N O C N | C C N C N | k ---------------------------------------------------------------- 1 | 61 62 71 81 62 71 | 61 62 71 61 71 | 4 2 | 184 256 204 224 276 204 | 184 256 204 193 335 | 7 3 | 624 900 664 724 980 684 | 624 900 664 721 1391 | 8 4 | 2148 3088 2228 2428 3368 2348 | 2148 3088 2228 2833 5615 | 8 But still, unless the disjoint fragments starts to "see" each other somehow, I don't know how the algorithm's stopping condition can be reached as atoms belonging to CC#N fragments will have the same indices for any $n$. > It may be that you're expressing step 2 incorrectly. There's a compact > explanation of the algorithm on page 564 of the review from Chen et. al > that has the key stopping point that you may be missing in step 6: if the > number of unique atom classes does not increase from one iteration to the > next, terminate the iterations. As far as I understand the explanation on page 564 describes the original *Morgan* algorithm (the heading may be misleading). The actual Lynch-Willett procedure is described on the next page and this requirement is not mentioned at all, at least explicitly. Though, this approach (checking either LW *or* Morgan stopping condition) could work in this case. The last column in the table above indicates the total number of unique indices. So after stopping at $n = 3$, we would have the match radius $r \equiv n - 2 = 1$ bond and established equivalences between all atoms of the CC#N fragments. However, I'm afraid it may be a stretch. Unless I'm misinterpreting something, the original paper by Lynch and Willett indicates that after second "iteration" match radius is equal 3 and we should get the equivalence (equal values of indices) only between nitrogen atoms of CC#N fragments. The first "iteration" proceeds exactly "by the book", so those discrepancies in the second are bit concerning. Anyway, I really appreciate your input Greg and if you or anyone else has any other suggestions I will be happy to hear them. Best, -- MikoĊaj Kowalik ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss