On 21/08/2019 17:34, Andrew Dalke wrote:
On Aug 21, 2019, at 03:42, Francois Berenger <mli...@ligand.eu> wrote:
Unless rdkit has something, I think graph edit distance is the kind
of things for which you have to rely on a good graph library.

Do you know of any (non-chemical) graph library which can handle edits
involving the breaking of aromatic bonds in a chemically correct way?
I do not.

Also, maybe the string edit distance between the two canonical smiles is a good enough proxy.

This attempt of mine now, to experiment with graph edit distance, came
out of a conversation I had last week with someone using string edit
distance. I expressed doubt on how "good" the "good enough" was, but
was unable to give any concrete details.

I earlier wrote:
For chain bonds, and non-aromatic bonds, it's easy to delete the bond
and add the correct number of hydrogens to either side.

Similarly, for many chain edits, the string edit distance is a decent
proxy, as you say.

However, has the goodness ever been characterized? Along with a
description of how to minimize the problems with string edit distance?
Some of the obvious ones are:

1) Chirality and stereochemistry

L-alanine and D-alanine have a graph edit distance to alanine with
unspecified chirality are 4 and 5, respectively.

  N[C@H](C)C(=O)O
  N[C@@H](C)C(=O)O
  NC(C)C(=O)O

This does not seem reasonable. A similar issue occurs with double bond
sterochemistry, like F/C=C/F vs. FC=CF.

2) Isotopes

Same issue: CN vs. [14CH3]N.

3) Overlapping element symbols

c1ccccc1C and c1ccccc1Cl have an edit distance of 1
c1ccccc1C and c1ccccc1Br have an edit distance of 2

There is no chemical sense for those to have different distances.

I can think of ways to mitigate some of the effects of #1-3.

If you want to push this hack further, it seems that some string
tokenization would be useful. Then the string edit distance is run
on lists of tokens instead of the original strings (maybe that's what you
call a substitution matrix).

In
particularly, a substitution matrix (or conversion to pharmacophore
reduced graphs) can improve #3.

4) Sensitivity to canonicalization order

Depending on the canonicalization method, the following two structures
either have a string edit distance of 1 or 4, while the graph edit
distance is 1.

Chem.CanonSmiles("PCCN")
'NCCP'
Chem.CanonSmiles("CCN")
'CCN'


5) difficulty in handling ring formation in a meaningful way

Chem.CanonSmiles("C1=CC=CC=C1")
'c1ccccc1'
Chem.CanonSmiles("C=CC=CC=C")
'C=CC=CC=C'

There are no shared string synbols, so the string edit distance is 9,
yet the bond edit distance is only 1.

Yes, hacks don't bring you very far, usually. :)

Regards,
F.


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to