Re: [Rdkit-discuss] aromatic bonds and graph edit distance
On 21/08/2019 17:34, Andrew Dalke wrote: On Aug 21, 2019, at 03:42, Francois Berenger wrote: Unless rdkit has something, I think graph edit distance is the kind of things for which you have to rely on a good graph library. Do you know of any (non-chemical) graph library which can handle edits involving the breaking of aromatic bonds in a chemically correct way? I do not. Also, maybe the string edit distance between the two canonical smiles is a good enough proxy. This attempt of mine now, to experiment with graph edit distance, came out of a conversation I had last week with someone using string edit distance. I expressed doubt on how "good" the "good enough" was, but was unable to give any concrete details. I earlier wrote: For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add the correct number of hydrogens to either side. Similarly, for many chain edits, the string edit distance is a decent proxy, as you say. However, has the goodness ever been characterized? Along with a description of how to minimize the problems with string edit distance? Some of the obvious ones are: 1) Chirality and stereochemistry L-alanine and D-alanine have a graph edit distance to alanine with unspecified chirality are 4 and 5, respectively. N[C@H](C)C(=O)O N[C@@H](C)C(=O)O NC(C)C(=O)O This does not seem reasonable. A similar issue occurs with double bond sterochemistry, like F/C=C/F vs. FC=CF. 2) Isotopes Same issue: CN vs. [14CH3]N. 3) Overlapping element symbols c1c1C and c1c1Cl have an edit distance of 1 c1c1C and c1c1Br have an edit distance of 2 There is no chemical sense for those to have different distances. I can think of ways to mitigate some of the effects of #1-3. If you want to push this hack further, it seems that some string tokenization would be useful. Then the string edit distance is run on lists of tokens instead of the original strings (maybe that's what you call a substitution matrix). In particularly, a substitution matrix (or conversion to pharmacophore reduced graphs) can improve #3. 4) Sensitivity to canonicalization order Depending on the canonicalization method, the following two structures either have a string edit distance of 1 or 4, while the graph edit distance is 1. Chem.CanonSmiles("PCCN") 'NCCP' Chem.CanonSmiles("CCN") 'CCN' 5) difficulty in handling ring formation in a meaningful way Chem.CanonSmiles("C1=CC=CC=C1") 'c1c1' Chem.CanonSmiles("C=CC=CC=C") 'C=CC=CC=C' There are no shared string synbols, so the string edit distance is 9, yet the bond edit distance is only 1. Yes, hacks don't bring you very far, usually. :) Regards, F. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
I don't really understand the details of what you're trying to do (and don't currently have the time to dig into it anyway), but perhaps I can help with at least part of this: On Tue, Aug 20, 2019 at 1:08 PM Andrew Dalke wrote: > > Problem is, I don't know how to figure out if a given aromatic bond must > be a "-" or "=", or can be both. > I think it's relatively easy to figure out if a particular aromatic bond can only be a single bond; at least for most cases: Here are the cases where the aromatic bonds from an atom must be single bonds: 1) has an explicit double bond - overs exocyclic bonds and bonds in non-aromatic rings 2) is O, Se, or Te and charge ==0 - covers c1cocc1, c1c[se]ccc1, c1c[oH+]ccc1 (not sure if that's actually useful) 3) is N or P and charge==0 and numExplicitHs=1. This ignores weird cases like radicals, C-, [oH+] etc. I think other aromatic bonds can be either single or double (unless they're adjacent to one of these, in which case they have to be double, but that's a "higher order effect"). Is that useful at all? ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
Hi Jameed, I don't think your approach will work, which means I likely didn't explain myself well enough. Let's say I start with: Cc1cc2c2c(=O)o1 - https://cactus.nci.nih.gov/chemical/structure/Cc1cc2c2c(=O)o1/image I want to break the aromatic bond between the aromatic 'c' (also double-bonded to the O) and the aromatic 'o'. This is possible because a Kekule interpretation of that molecule is: >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("Cc1cc2c2c(=O)o1") >>> Chem.Kekulize(mol) >>> Chem.MolToSmiles(mol) 'Cc1cc2c2c(=O)o1' >>> Chem.MolToSmiles(mol, kekuleSmiles=True) 'CC1=CC2=CC=CC=C2C(=O)O1' and I could break it by simply inserting a "." in the right spot, like this: CC1=CC2=CC=CC=C2C(=O).O1 >>> mol2 = Chem.MolFromSmiles("CC1=CC2=CC=CC=C2C(=O).O1") >>> Chem.MolToSmiles(mol2) 'CC(O)=Cc1c1C=O' The resulting structure still has an aromatic ring, but not all of the atoms which were flagged as aromatic are still aromatic. With the approach you outlined, I don't know how to restored the modified structure so I can generate the final SMILES I expect. The Kekulize() approach finds a valid Kekule interpretation, but there may be multiple interpretations, allowing both a single and double bond for a given bond position. I think I want to generate both possibilities. So far the only solution I've come up with - and it's a real brute-force one - is to do complete enumeration, along the lines of the code after my signature. % python get_kekule_bonds.py 'Cc1cc2c2c(=O)o1' bondfragatoms allowed 1 c:c 1,2 DOUBLE 2 c:c 2,3 SINGLE 3 c:c 3,4 SINGLE DOUBLE 4 c:c 4,5 SINGLE DOUBLE 5 c:c 5,6 SINGLE DOUBLE 6 c:c 6,7 SINGLE DOUBLE 7 c:c 7,8 SINGLE DOUBLE 8 c:c 8,9 SINGLE 10 c:o 9,11SINGLE 11 c:o 11,1SINGLE 12 c:c 8,3 SINGLE DOUBLE (To be usable for what I want, I also need to figure out where those bond terms are in the Kekeule SMILES, so I can either replace it with a "." or remove the ring closure, then re-perceive.) This can work for what I'm looking for, since I have <=14 heavies, but I'm hoping for something rather less crude. Andrew da...@dalkescientific.com # List possible Kekule bond assignments for aromatic bonds in a SMILES from rdkit import Chem import itertools import sys def get_kekule_bonds(mol): """mol -> possible bond assignment table Returns a table mapping bond index (for aromatic bonds) to possible Kekule bond assignments for that bond. """ # Figure out which bonds are aromatic aromatic_bond_indices = {} for bond in mol.GetBonds(): if bond.GetIsAromatic(): a1, a2 = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx() if a1 > a2: a1, a2 = a2, a1 aromatic_bond_indices[(a1, a2)] = bond.GetIdx() if not aromatic_bond_indices: return {} # Look for assignments which result in the same Kekeule SMILES # (with aromatic bonds indicated as ':') expected_kekule_smiles = Chem.MolToSmiles(mol, kekuleSmiles=True) # Initialize the table of bond assignments possible_bond_assignments = dict((idx, set()) for idx in aromatic_bond_indices.values()) # Figure out how to get from the canonical Kekule SMILES output order # back to the moleule order. s = mol.GetProp("_smilesAtomOutputOrder") assert s[0] == "[" and s[-2:] == ",]", s invert_order = dict(enumerate(map(int, s[1:-2].split("," # Figure out the places where I can replace a ":" with a "-" or "=" parts = expected_kekule_smiles.split(":") N = len(parts)-1 assert N == len(aromatic_bond_indices) terms = [None] * (2*N+1) terms[::2] = parts # Generate all possible bond assignments (generates 2**N assignments) for bond_assignment in itertools.product( *(("-=",)*N) ): # Replace the places where the ":" used to be (or rather, the previous assignment) # and interpret as a new SMILES terms[1::2] = bond_assignment new_kekule_smiles = "".join(terms) new_mol = Chem.MolFromSmiles(new_kekule_smiles) if new_mol is None: continue # Was there a match? canonical_kekule_smiles = Chem.MolToSmiles(new_mol, kekuleSmiles=True) if canonical_kekule_smiles != expected_kekule_smiles: continue # Yes! That means we can figure out what the actual bond assignments # were. Re-process the Kekule SMILES, but don't do aromaticity perception. raw_mol = Chem.MolFromSmiles(new_kekule_smiles, sanitize=False) for bond in raw_mol.GetBonds(): # Need to map back to the original bonds original_a1 = invert_order[bond.GetBeginAtomIdx()] original_a2 =
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
On Aug 21, 2019, at 03:42, Francois Berenger wrote: > Unless rdkit has something, I think graph edit distance is the kind > of things for which you have to rely on a good graph library. Do you know of any (non-chemical) graph library which can handle edits involving the breaking of aromatic bonds in a chemically correct way? I do not. > Also, maybe the string edit distance between the two canonical smiles is a > good enough proxy. This attempt of mine now, to experiment with graph edit distance, came out of a conversation I had last week with someone using string edit distance. I expressed doubt on how "good" the "good enough" was, but was unable to give any concrete details. I earlier wrote: >> For chain bonds, and non-aromatic bonds, it's easy to delete the bond >> and add the correct number of hydrogens to either side. Similarly, for many chain edits, the string edit distance is a decent proxy, as you say. However, has the goodness ever been characterized? Along with a description of how to minimize the problems with string edit distance? Some of the obvious ones are: 1) Chirality and stereochemistry L-alanine and D-alanine have a graph edit distance to alanine with unspecified chirality are 4 and 5, respectively. N[C@H](C)C(=O)O N[C@@H](C)C(=O)O NC(C)C(=O)O This does not seem reasonable. A similar issue occurs with double bond sterochemistry, like F/C=C/F vs. FC=CF. 2) Isotopes Same issue: CN vs. [14CH3]N. 3) Overlapping element symbols c1c1C and c1c1Cl have an edit distance of 1 c1c1C and c1c1Br have an edit distance of 2 There is no chemical sense for those to have different distances. I can think of ways to mitigate some of the effects of #1-3. In particularly, a substitution matrix (or conversion to pharmacophore reduced graphs) can improve #3. 4) Sensitivity to canonicalization order Depending on the canonicalization method, the following two structures either have a string edit distance of 1 or 4, while the graph edit distance is 1. >>> Chem.CanonSmiles("PCCN") 'NCCP' >>> Chem.CanonSmiles("CCN") 'CCN' 5) difficulty in handling ring formation in a meaningful way >>> Chem.CanonSmiles("C1=CC=CC=C1") 'c1c1' >>> Chem.CanonSmiles("C=CC=CC=C") 'C=CC=CC=C' There are no shared string synbols, so the string edit distance is 9, yet the bond edit distance is only 1. It is this last issue that I am particularly concerned with, leading me to ask about how to handle aromatic bonds when computing the graph edit distance. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
Hi Andrew, You could substitute the aromatic atoms into non-organic atoms (eg. aromatic carbon to calcium [Ca], aromatic nitrogen to sodium [Na] and so on with single bonds between them), and use the atom deleting procedure as normal. The structures should still be valid so should be able to use the usual rdkit functions on them. Thanks Jameed From: Andrew Dalke Sent: 20 August 2019 21:06 To: RDKit Discuss Subject: [Rdkit-discuss] aromatic bonds and graph edit distance Hi all, Someone asked me recently about finding the graph edit distance of two small (<= 14 atom) fragments. I figured this was something that could be brute forced. Following SmallWorld's example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a fragment, incrementally delete terminals (except the "*" connection point atom), and ring bonds. For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add the correct number of hydrogens to either side. But, what should I do when I cut an aromatic bond? For something like the first "co" in "c1cocn1", I want the result to be C=CN=CO. That's because the "o" can only be "-O-" in Kekule form. For something like "c1cnncn1", breaking on the "nn", I think I would like to get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a double bond, depending on the Kekule representation, as in: >>> Chem.CanonSmiles("C-1=N-N=C-C=N-1") 'c1cnncn1' >>> Chem.CanonSmiles("C-1=N.N=C-C=N-1") 'N=CC=NC=N' >>> Chem.CanonSmiles("C=1-N=N-C=C-N=1") 'c1cnncn1' >>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1") 'NC=CN=CN' Problem is, I don't know how to figure out if a given aromatic bond must be a "-" or "=", or can be both. (Well, I could brute-force enumerae all 2**n possible aromatic bond assignments, then canonicalize, and see if both assignments are possible for a given bond.) As a non-chemist, I also ask if I'm even on a chemically meaningful track. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss Disclaimer This electronic mail and its attachments are intended solely for the person(s) to whom they are addressed and contain information which is confidential or otherwise protected from disclosure, except for the purpose for which they are intended. Dissemination, distribution, or reproduction by anyone other than the intended recipients is prohibited and may be illegal. If you are not an intended recipient, please immediately inform the sender and return the electronic mail and its attachments and destroy any copies which may be in your possession. Dotmatics Limited screens electronic mails for viruses but does not warrant that this electronic mail is free of any viruses. Dotmatics Limited accepts no liability for any damage caused by any virus transmitted by this electronic mail. Dotmatics Limited is registered in England & Wales No. 5614524 with offices at The Old Monastery, Windhill, Bishops Stortford, Herts, CM23 2ND, UK. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
On 21/08/2019 05:06, Andrew Dalke wrote: Hi all, Someone asked me recently about finding the graph edit distance of two small (<= 14 atom) fragments. I figured this was something that could be brute forced. Following SmallWorld's example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a fragment, incrementally delete terminals (except the "*" connection point atom), and ring bonds. Unless rdkit has something, I think graph edit distance is the kind of things for which you have to rely on a good graph library. Also, maybe the string edit distance between the two canonical smiles is a good enough proxy. For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add the correct number of hydrogens to either side. But, what should I do when I cut an aromatic bond? For something like the first "co" in "c1cocn1", I want the result to be C=CN=CO. That's because the "o" can only be "-O-" in Kekule form. For something like "c1cnncn1", breaking on the "nn", I think I would like to get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a double bond, depending on the Kekule representation, as in: Chem.CanonSmiles("C-1=N-N=C-C=N-1") 'c1cnncn1' Chem.CanonSmiles("C-1=N.N=C-C=N-1") 'N=CC=NC=N' Chem.CanonSmiles("C=1-N=N-C=C-N=1") 'c1cnncn1' Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1") 'NC=CN=CN' Problem is, I don't know how to figure out if a given aromatic bond must be a "-" or "=", or can be both. (Well, I could brute-force enumerae all 2**n possible aromatic bond assignments, then canonicalize, and see if both assignments are possible for a given bond.) As a non-chemist, I also ask if I'm even on a chemically meaningful track. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] aromatic bonds and graph edit distance
Hi all, Someone asked me recently about finding the graph edit distance of two small (<= 14 atom) fragments. I figured this was something that could be brute forced. Following SmallWorld's example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a fragment, incrementally delete terminals (except the "*" connection point atom), and ring bonds. For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add the correct number of hydrogens to either side. But, what should I do when I cut an aromatic bond? For something like the first "co" in "c1cocn1", I want the result to be C=CN=CO. That's because the "o" can only be "-O-" in Kekule form. For something like "c1cnncn1", breaking on the "nn", I think I would like to get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a double bond, depending on the Kekule representation, as in: >>> Chem.CanonSmiles("C-1=N-N=C-C=N-1") 'c1cnncn1' >>> Chem.CanonSmiles("C-1=N.N=C-C=N-1") 'N=CC=NC=N' >>> Chem.CanonSmiles("C=1-N=N-C=C-N=1") 'c1cnncn1' >>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1") 'NC=CN=CN' Problem is, I don't know how to figure out if a given aromatic bond must be a "-" or "=", or can be both. (Well, I could brute-force enumerae all 2**n possible aromatic bond assignments, then canonicalize, and see if both assignments are possible for a given bond.) As a non-chemist, I also ask if I'm even on a chemically meaningful track. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss