Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Francois Berenger

On 21/08/2019 17:34, Andrew Dalke wrote:

On Aug 21, 2019, at 03:42, Francois Berenger  wrote:

Unless rdkit has something, I think graph edit distance is the kind
of things for which you have to rely on a good graph library.


Do you know of any (non-chemical) graph library which can handle edits
involving the breaking of aromatic bonds in a chemically correct way?
I do not.

Also, maybe the string edit distance between the two canonical smiles 
is a good enough proxy.


This attempt of mine now, to experiment with graph edit distance, came
out of a conversation I had last week with someone using string edit
distance. I expressed doubt on how "good" the "good enough" was, but
was unable to give any concrete details.

I earlier wrote:

For chain bonds, and non-aromatic bonds, it's easy to delete the bond
and add the correct number of hydrogens to either side.


Similarly, for many chain edits, the string edit distance is a decent
proxy, as you say.

However, has the goodness ever been characterized? Along with a
description of how to minimize the problems with string edit distance?
Some of the obvious ones are:

1) Chirality and stereochemistry

L-alanine and D-alanine have a graph edit distance to alanine with
unspecified chirality are 4 and 5, respectively.

  N[C@H](C)C(=O)O
  N[C@@H](C)C(=O)O
  NC(C)C(=O)O

This does not seem reasonable. A similar issue occurs with double bond
sterochemistry, like F/C=C/F vs. FC=CF.

2) Isotopes

Same issue: CN vs. [14CH3]N.

3) Overlapping element symbols

c1c1C and c1c1Cl have an edit distance of 1
c1c1C and c1c1Br have an edit distance of 2

There is no chemical sense for those to have different distances.

I can think of ways to mitigate some of the effects of #1-3.


If you want to push this hack further, it seems that some string
tokenization would be useful. Then the string edit distance is run
on lists of tokens instead of the original strings (maybe that's what 
you

call a substitution matrix).


In
particularly, a substitution matrix (or conversion to pharmacophore
reduced graphs) can improve #3.

4) Sensitivity to canonicalization order

Depending on the canonicalization method, the following two structures
either have a string edit distance of 1 or 4, while the graph edit
distance is 1.


Chem.CanonSmiles("PCCN")

'NCCP'

Chem.CanonSmiles("CCN")

'CCN'


5) difficulty in handling ring formation in a meaningful way


Chem.CanonSmiles("C1=CC=CC=C1")

'c1c1'

Chem.CanonSmiles("C=CC=CC=C")

'C=CC=CC=C'

There are no shared string synbols, so the string edit distance is 9,
yet the bond edit distance is only 1.


Yes, hacks don't bring you very far, usually. :)

Regards,
F.


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Greg Landrum
I don't really understand the details of what you're trying to do (and
don't currently have the time to dig into it anyway), but perhaps I can
help with at least part of this:

On Tue, Aug 20, 2019 at 1:08 PM Andrew Dalke 
wrote:

>
> Problem is, I don't know how to figure out if a given aromatic bond must
> be a "-" or "=", or can be both.
>

I think it's relatively easy to figure out if a particular aromatic bond
can only be a single bond; at least for most cases:

Here are the cases where the aromatic bonds from an atom must be single
bonds:
1) has an explicit double bond  - overs exocyclic bonds and bonds in
non-aromatic rings
2) is O, Se, or Te and charge ==0  - covers c1cocc1, c1c[se]ccc1,
c1c[oH+]ccc1 (not sure if that's actually useful)
3) is N or P and charge==0 and numExplicitHs=1.

This ignores weird cases like radicals, C-, [oH+] etc.

I think other aromatic bonds can be either single or double (unless they're
adjacent to one of these, in which case they have to be double, but that's
a "higher order effect").

Is that useful at all?
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Andrew Dalke
Hi Jameed,

  I don't think your approach will work, which means I likely didn't explain 
myself well enough.

Let's say I start with:

   Cc1cc2c2c(=O)o1 - 
https://cactus.nci.nih.gov/chemical/structure/Cc1cc2c2c(=O)o1/image

I want to break the aromatic bond between the aromatic 'c' (also double-bonded 
to the O) and the aromatic 'o'.

This is possible because a Kekule interpretation of that molecule is: 

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("Cc1cc2c2c(=O)o1")
>>> Chem.Kekulize(mol)
>>> Chem.MolToSmiles(mol)
'Cc1cc2c2c(=O)o1'
>>> Chem.MolToSmiles(mol, kekuleSmiles=True)
'CC1=CC2=CC=CC=C2C(=O)O1'

and I could break it by simply inserting a "." in the right spot, like this:

  CC1=CC2=CC=CC=C2C(=O).O1

>>> mol2 = Chem.MolFromSmiles("CC1=CC2=CC=CC=C2C(=O).O1")
>>> Chem.MolToSmiles(mol2)
'CC(O)=Cc1c1C=O'

The resulting structure still has an aromatic ring, but not all of the atoms 
which were flagged as aromatic are still aromatic.

With the approach you outlined, I don't know how to restored the modified 
structure so I can generate the final SMILES I expect.

The Kekulize() approach finds a valid Kekule interpretation, but there may be 
multiple interpretations, allowing both a single and double bond for a given 
bond position. I think I want to generate both possibilities.

So far the only solution I've come up with - and it's a real brute-force one - 
is to do complete enumeration, along the lines of the code after my signature.


% python get_kekule_bonds.py 'Cc1cc2c2c(=O)o1'
bondfragatoms   allowed
1   c:c 1,2 DOUBLE
2   c:c 2,3 SINGLE
3   c:c 3,4 SINGLE DOUBLE
4   c:c 4,5 SINGLE DOUBLE
5   c:c 5,6 SINGLE DOUBLE
6   c:c 6,7 SINGLE DOUBLE
7   c:c 7,8 SINGLE DOUBLE
8   c:c 8,9 SINGLE
10  c:o 9,11SINGLE
11  c:o 11,1SINGLE
12  c:c 8,3 SINGLE DOUBLE

(To be usable for what I want, I also need to figure out where those bond terms 
are in the Kekeule SMILES, so I can either replace it with a "." or remove the 
ring closure, then re-perceive.)

This can work for what I'm looking for, since I have <=14 heavies, but I'm 
hoping for something rather less crude.

Andrew
da...@dalkescientific.com


# List possible Kekule bond assignments for aromatic bonds in a SMILES
from rdkit import Chem
import itertools
import sys

def get_kekule_bonds(mol):
"""mol -> possible bond assignment table

Returns a table mapping bond index (for aromatic bonds) to
possible Kekule bond assignments for that bond.
"""
# Figure out which bonds are aromatic
aromatic_bond_indices = {}
for bond in mol.GetBonds():
if bond.GetIsAromatic():
a1, a2 = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
if a1 > a2:
a1, a2 = a2, a1
aromatic_bond_indices[(a1, a2)] = bond.GetIdx()

if not aromatic_bond_indices:
return {}

# Look for assignments which result in the same Kekeule SMILES
# (with aromatic bonds indicated as ':')
expected_kekule_smiles = Chem.MolToSmiles(mol, kekuleSmiles=True)

# Initialize the table of bond assignments
possible_bond_assignments = dict((idx, set()) for idx in 
aromatic_bond_indices.values())

# Figure out how to get from the canonical Kekule SMILES output order
# back to the moleule order.
s = mol.GetProp("_smilesAtomOutputOrder")
assert s[0] == "[" and s[-2:] == ",]", s
invert_order = dict(enumerate(map(int, s[1:-2].split(","

# Figure out the places where I can replace a ":" with a "-" or "="
parts = expected_kekule_smiles.split(":")
N = len(parts)-1
assert N == len(aromatic_bond_indices)
terms = [None] * (2*N+1)
terms[::2] = parts

# Generate all possible bond assignments (generates 2**N assignments)
for bond_assignment in itertools.product( *(("-=",)*N) ):
# Replace the places where the ":" used to be (or rather, the previous 
assignment)
# and interpret as a new SMILES
terms[1::2] = bond_assignment
new_kekule_smiles = "".join(terms)
new_mol = Chem.MolFromSmiles(new_kekule_smiles)
if new_mol is None:
continue

# Was there a match?
canonical_kekule_smiles = Chem.MolToSmiles(new_mol, kekuleSmiles=True)
if canonical_kekule_smiles != expected_kekule_smiles:
continue

# Yes! That means we can figure out what the actual bond assignments
# were. Re-process the Kekule SMILES, but don't do aromaticity 
perception.
raw_mol = Chem.MolFromSmiles(new_kekule_smiles, sanitize=False)
for bond in raw_mol.GetBonds():
# Need to map back to the original bonds
original_a1 = invert_order[bond.GetBeginAtomIdx()]
original_a2 = 

Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Andrew Dalke


On Aug 21, 2019, at 03:42, Francois Berenger  wrote:
> Unless rdkit has something, I think graph edit distance is the kind
> of things for which you have to rely on a good graph library.

Do you know of any (non-chemical) graph library which can handle edits 
involving the breaking of aromatic bonds in a chemically correct way? I do not.

> Also, maybe the string edit distance between the two canonical smiles is a 
> good enough proxy.

This attempt of mine now, to experiment with graph edit distance, came out of a 
conversation I had last week with someone using string edit distance. I 
expressed doubt on how "good" the "good enough" was, but was unable to give any 
concrete details.

I earlier wrote:
>> For chain bonds, and non-aromatic bonds, it's easy to delete the bond
>> and add the correct number of hydrogens to either side.

Similarly, for many chain edits, the string edit distance is a decent proxy, as 
you say.

However, has the goodness ever been characterized? Along with a description of 
how to minimize the problems with string edit distance? Some of the obvious 
ones are:

1) Chirality and stereochemistry

L-alanine and D-alanine have a graph edit distance to alanine with unspecified 
chirality are 4 and 5, respectively. 

  N[C@H](C)C(=O)O
  N[C@@H](C)C(=O)O
  NC(C)C(=O)O

This does not seem reasonable. A similar issue occurs with double bond 
sterochemistry, like F/C=C/F vs. FC=CF.

2) Isotopes

Same issue: CN vs. [14CH3]N.

3) Overlapping element symbols

c1c1C and c1c1Cl have an edit distance of 1
c1c1C and c1c1Br have an edit distance of 2

There is no chemical sense for those to have different distances.

I can think of ways to mitigate some of the effects of #1-3. In particularly, a 
substitution matrix (or conversion to pharmacophore reduced graphs) can improve 
#3.

4) Sensitivity to canonicalization order

Depending on the canonicalization method, the following two structures either 
have a string edit distance of 1 or 4, while the graph edit distance is 1.

>>> Chem.CanonSmiles("PCCN")
'NCCP'
>>> Chem.CanonSmiles("CCN")
'CCN'


5) difficulty in handling ring formation in a meaningful way

>>> Chem.CanonSmiles("C1=CC=CC=C1")
'c1c1'
>>> Chem.CanonSmiles("C=CC=CC=C")
'C=CC=CC=C'

There are no shared string synbols, so the string edit distance is 9, yet the 
bond edit distance is only 1.

It is this last issue that I am particularly concerned with, leading me to ask 
about how to handle aromatic bonds when computing the graph edit distance.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Jameed Hussain
Hi Andrew,

You could substitute the aromatic atoms into non-organic atoms (eg. aromatic 
carbon to calcium [Ca], aromatic nitrogen to sodium [Na] and so on with single 
bonds between them), and use the atom deleting procedure as normal. The 
structures should still be valid so should be able to use the usual rdkit 
functions on them.

Thanks
Jameed



From: Andrew Dalke 
Sent: 20 August 2019 21:06
To: RDKit Discuss 
Subject: [Rdkit-discuss] aromatic bonds and graph edit distance

Hi all,

  Someone asked me recently about finding the graph edit distance of two small 
(<= 14 atom) fragments.

I figured this was something that could be brute forced. Following SmallWorld's 
example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a 
fragment, incrementally delete terminals (except the "*" connection point 
atom), and ring bonds.

For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add 
the correct number of hydrogens to either side.

But, what should I do when I cut an aromatic bond?

For something like the first "co" in "c1cocn1", I want the result to be 
C=CN=CO. That's because the "o" can only be "-O-" in Kekule form.

For something like "c1cnncn1", breaking on the "nn", I think I would like to 
get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a 
double bond, depending on the Kekule representation, as in:

>>> Chem.CanonSmiles("C-1=N-N=C-C=N-1")
'c1cnncn1'
>>> Chem.CanonSmiles("C-1=N.N=C-C=N-1")
'N=CC=NC=N'

>>> Chem.CanonSmiles("C=1-N=N-C=C-N=1")
'c1cnncn1'
>>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1")
'NC=CN=CN'

Problem is, I don't know how to figure out if a given aromatic bond must be a 
"-" or "=", or can be both.

(Well, I could brute-force enumerae all 2**n possible aromatic bond 
assignments, then canonicalize, and see if both assignments are possible for a 
given bond.)

As a non-chemist, I also ask if I'm even on a chemically meaningful track.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Disclaimer

This electronic mail and its attachments are intended solely for the person(s) 
to whom they are addressed and contain information which is  confidential or 
otherwise protected from disclosure, except for the purpose for which they are 
intended. Dissemination, distribution, or reproduction by anyone other than the 
intended recipients is prohibited and may be illegal. If you are not an 
intended recipient, please immediately inform the sender and return the 
electronic mail and its attachments and destroy any copies which may be in your 
possession. Dotmatics Limited screens electronic mails for viruses but does not 
warrant that this electronic mail is free of any viruses. Dotmatics Limited 
accepts no liability for any damage caused by any virus transmitted by this 
electronic mail. Dotmatics Limited is  registered in England & Wales No. 
5614524 with offices at The Old Monastery,  Windhill, Bishops Stortford, Herts, 
CM23 2ND, UK.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-20 Thread Francois Berenger

On 21/08/2019 05:06, Andrew Dalke wrote:

Hi all,

  Someone asked me recently about finding the graph edit distance of
two small (<= 14 atom) fragments.

I figured this was something that could be brute forced. Following
SmallWorld's example at
https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a fragment,
incrementally delete terminals (except the "*" connection point atom),
and ring bonds.


Unless rdkit has something, I think graph edit distance is the kind
of things for which you have to rely on a good graph library.

Also, maybe the string edit distance between the two canonical smiles is 
a good enough proxy.



For chain bonds, and non-aromatic bonds, it's easy to delete the bond
and add the correct number of hydrogens to either side.

But, what should I do when I cut an aromatic bond?

For something like the first "co" in "c1cocn1", I want the result to
be C=CN=CO. That's because the "o" can only be "-O-" in Kekule form.

For something like "c1cnncn1", breaking on the "nn", I think I would
like to get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a
single or a double bond, depending on the Kekule representation, as
in:


Chem.CanonSmiles("C-1=N-N=C-C=N-1")

'c1cnncn1'

Chem.CanonSmiles("C-1=N.N=C-C=N-1")

'N=CC=NC=N'


Chem.CanonSmiles("C=1-N=N-C=C-N=1")

'c1cnncn1'

Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1")

'NC=CN=CN'

Problem is, I don't know how to figure out if a given aromatic bond
must be a "-" or "=", or can be both.

(Well, I could brute-force enumerae all 2**n possible aromatic bond
assignments, then canonicalize, and see if both assignments are
possible for a given bond.)

As a non-chemist, I also ask if I'm even on a chemically meaningful 
track.



Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-20 Thread Andrew Dalke
Hi all,

  Someone asked me recently about finding the graph edit distance of two small 
(<= 14 atom) fragments.

I figured this was something that could be brute forced. Following SmallWorld's 
example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a 
fragment, incrementally delete terminals (except the "*" connection point 
atom), and ring bonds.

For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add 
the correct number of hydrogens to either side.

But, what should I do when I cut an aromatic bond?

For something like the first "co" in "c1cocn1", I want the result to be 
C=CN=CO. That's because the "o" can only be "-O-" in Kekule form.

For something like "c1cnncn1", breaking on the "nn", I think I would like to 
get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a 
double bond, depending on the Kekule representation, as in:

>>> Chem.CanonSmiles("C-1=N-N=C-C=N-1")
'c1cnncn1'
>>> Chem.CanonSmiles("C-1=N.N=C-C=N-1")
'N=CC=NC=N'

>>> Chem.CanonSmiles("C=1-N=N-C=C-N=1")
'c1cnncn1'
>>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1")
'NC=CN=CN'

Problem is, I don't know how to figure out if a given aromatic bond must be a 
"-" or "=", or can be both.

(Well, I could brute-force enumerae all 2**n possible aromatic bond 
assignments, then canonicalize, and see if both assignments are possible for a 
given bond.)

As a non-chemist, I also ask if I'm even on a chemically meaningful track.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss