Hi Susan, I haven't looked at the Sgroup or unsaturation flags yet (I will try to do this later today), but a word on the aromaticity/kekulization.
One of the things which is going wrong here at the cartridge level is that qmol_from_ctab() is sanitizing the molecules it reads in. This is not correct. qmol_from_ctab is intended to produce a query and queries should not be sanitized. I will get that fixed and, assuming it doesn't turn up other problems, we should be able to get that into the next patch release. Here's that bug report: https://github.com/rdkit/rdkit/issues/4787 There's another potential issue with how bonds from CTABs are parsed, inspired by your first message, which we're discussing here: https://github.com/rdkit/rdkit/issues/4785 Thanks for the very detailed descriptions of the problem! -greg On Fri, Dec 10, 2021 at 11:11 AM Susan Leung <susanhle...@gmail.com> wrote: > Hi Paolo, > > > Thanks very much for filing the bug and for offering the Python > preprocessing solution. > > > I actually have a few more CTABs that are not valid. One of which raises a > kekulisation error and like the previous non-ring aromaton atom problem, > there is an alternative SMARTS query that can be written. I suspect that > you might suggest some Python preprocessing for this and converting to > SMARTS? > > > The other two errors, I don’t think are to do with sanitization but > possibly due to the way the CTAB is read. One has multiple atoms with the > unsaturated flag turned on. The other has a SGROUP defined. > > Please see below, where I tried to summarize and I again attach a ipynb if > it helps. > > > Thanks very much, > > > Susan > > > Example 2: Kekulization > > I want a query CTAB to match the following two tautomers. > > sm1 = 'Cc1c[nH]nc1C' > > sm2 = 'Cc1cn[nH]c1C' > > > > I would like to use the following CTAB as the query but it is not valid , > I’m guessing it’s because of kekulization, which is the error produced when > doing MolFromMolBlock: > > ctab = """ > > ACCLDraw12082111532D > > > > 7 7 0 0 0 0 0 0 0 0999 V2000 > > 9.2840 -12.1344 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 10.2367 -11.4422 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 9.8729 -10.3217 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 > > 8.6950 -10.3217 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 > > 8.3309 -11.4422 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 7.1932 -11.7471 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 9.2840 -13.3122 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 2 1 4 0 0 0 0 > > 2 3 4 0 0 0 0 > > 3 4 4 0 0 0 0 > > 4 5 4 0 0 0 0 > > 5 1 4 0 0 0 0 > > 5 6 1 0 0 0 0 > > 7 1 1 0 0 0 0 > > M END > > """ > > select is_valid_ctab('{ctab}') > > > > Returns False > > I can make an alternative valid CTAB with a hydrogen on one of the > nitrogens that is valid, but then it doesn’t match both sm1 and sm2. > > ctab_fixed = """ > > ACCLDraw12082111272D > > > > 8 8 0 0 0 0 0 0 0 0999 V2000 > > 11.6590 -11.9469 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 12.6117 -11.2547 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 12.2479 -10.1342 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 > > 11.0700 -10.1342 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 > > 10.7059 -11.2547 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 > > 9.5682 -11.5596 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 11.6590 -13.1247 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 12.8368 -9.1142 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 > > 2 1 4 0 0 0 0 > > 2 3 4 0 0 0 0 > > 3 4 4 0 0 0 0 > > 4 5 4 0 0 0 0 > > 5 1 4 0 0 0 0 > > 5 6 1 0 0 0 0 > > 7 1 1 0 0 0 0 > > 3 8 1 0 0 0 0 > > M END > > """ > > > > select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed} > > > > Returns True > > select mol_from_smiles('{sm2}') @> qmol_from_ctab('{ctab_fixed}') > > > > Returns False > > > > However, I can make a qmol from SMARTS that can match with both: > > alt_smarts = '[#6]1(:[#6]:[#7]:[#7]:[#6]:1-[#6])-[#6]' > > > > select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}') > > > > select mol_from_smiles('{sm2}') @> qmol_from_smarts('{alt_smarts}' > > > > Return True for both. > > _______ > > Example 3 Unsaturated : > > How does RDKit handle the M UNS line? It can’t seem to handle this CTAB > for example, where multiple atoms (atom 8 and atom 9) have the unsaturated > flag on. > > ctab_og = """ > > ACCLDraw12082113482D > > > > 9 9 0 0 0 0 0 0 0 0999 V2000 > > 2.6030 -22.3675 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6258 -21.7773 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6447 -22.3669 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6447 -23.5480 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6283 -24.1374 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 2.6030 -23.5533 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6258 -20.5962 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 > > 2.6029 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6486 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 2 1 2 0 0 0 0 > > 3 2 1 0 0 0 0 > > 4 3 2 0 0 0 0 > > 5 4 1 0 0 0 0 > > 1 6 1 0 0 0 0 > > 6 5 2 0 0 0 0 > > 2 7 1 0 0 0 0 > > 7 8 1 0 0 0 0 > > 7 9 1 0 0 0 0 > > M UNS 2 8 1 9 1 > > M END > > """ > > ctab_fixed = """ > > ACCLDraw12082113482D > > > > 9 9 0 0 0 0 0 0 0 0999 V2000 > > 2.6030 -22.3675 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6258 -21.7773 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6447 -22.3669 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6447 -23.5480 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6283 -24.1374 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 2.6030 -23.5533 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 3.6258 -20.5962 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0 > > 2.6029 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 4.6486 -20.0057 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 > > 2 1 2 0 0 0 0 > > 3 2 1 0 0 0 0 > > 4 3 2 0 0 0 0 > > 5 4 1 0 0 0 0 > > 1 6 1 0 0 0 0 > > 6 5 2 0 0 0 0 > > 2 7 1 0 0 0 0 > > 7 8 1 0 0 0 0 > > 7 9 1 0 0 0 0 > > M UNS 1 8 1 > > M END > > > > """ > > > > select is_valid_ctab('{ctab_og}') > > select is_valid_ctab('{ctab_fixed}') > > Returns False and True respectively > > MolFromMolBlock(ctab_og) gives warning Value 9 is not supported as an > unsaturation query (only 0 and 1 are allowed). line: 23 > > *Example 4 – Sgroup* : > > I can’t work out why RDKit cannot accept this CTAB. I have other cases > that have Sgroups so I know RDKit can handle them. If I switch the order of > LABEL and CSTATE in the SGROUP block, it is valid. And if I remove the > SGROUP block entirely it also reads fine. > > ctab_og = """ > > ACCLDraw12082117492D > > > > 0 0 0 0 0 999 V3000 > > M V30 BEGIN CTAB > > M V30 COUNTS 9 9 1 0 0 > > M V30 BEGIN ATOM > > M V30 1 C 27.8908 -28.8436 0 0 > > M V30 2 C 27.2981 -29.8716 0 0 > > M V30 3 C 29.072 -28.8436 0 0 > > M V30 4 C 27.8937 -30.8881 0 0 > > M V30 5 C 29.0691 -30.8881 0 0 > > M V30 6 C 29.6604 -29.8644 0 0 > > M V30 7 N 29.6604 -31.9103 0 0 CHG=1 > > M V30 8 O 29.0691 -32.934 0 0 > > M V30 9 O 30.8416 -31.9103 0 0 CHG=-1 > > M V30 END ATOM > > M V30 BEGIN BOND > > M V30 1 1 1 2 > > M V30 2 2 3 1 > > M V30 3 2 2 4 > > M V30 4 1 4 5 > > M V30 5 1 6 3 > > M V30 6 2 6 5 > > M V30 7 1 5 7 > > M V30 8 2 7 8 > > M V30 9 1 7 9 > > M V30 END BOND > > M V30 BEGIN SGROUP > > M V30 1 SUP 0 ATOMS=(3 7 8 9) XBONDS=(1 7) LABEL="NO2" CSTATE=(4 7 > - > > M V30 -0.4100 0.7100 0.0000) > > M V30 END SGROUP > > M V30 END CTAB > > M END > > """ > > ctab_fixed = """ > > ACCLDraw12082117492D > > > > 0 0 0 0 0 999 V3000 > > M V30 BEGIN CTAB > > M V30 COUNTS 9 9 1 0 0 > > M V30 BEGIN ATOM > > M V30 1 C 27.8908 -28.8436 0 0 > > M V30 2 C 27.2981 -29.8716 0 0 > > M V30 3 C 29.072 -28.8436 0 0 > > M V30 4 C 27.8937 -30.8881 0 0 > > M V30 5 C 29.0691 -30.8881 0 0 > > M V30 6 C 29.6604 -29.8644 0 0 > > M V30 7 N 29.6604 -31.9103 0 0 CHG=1 > > M V30 8 O 29.0691 -32.934 0 0 > > M V30 9 O 30.8416 -31.9103 0 0 CHG=-1 > > M V30 END ATOM > > M V30 BEGIN BOND > > M V30 1 1 1 2 > > M V30 2 2 3 1 > > M V30 3 2 2 4 > > M V30 4 1 4 5 > > M V30 5 1 6 3 > > M V30 6 2 6 5 > > M V30 7 1 5 7 > > M V30 8 2 7 8 > > M V30 9 1 7 9 > > M V30 END BOND > > M V30 BEGIN SGROUP > > M V30 1 SUP 0 ATOMS=(3 7 8 9) XBONDS=(1 7) CSTATE=(4 7 - > > M V30 -0.4100 0.7100 0.0000) LABEL="NO2" > > M V30 END SGROUP > > M V30 END CTAB > > M END > > """ > > > > select is_valid_ctab('{ctab_og}') > > Returns False > > select is_valid_ctab('{ctab_fixed}') > > Returns True > > > On Thu, Dec 9, 2021 at 10:06 PM Paolo Tosco <paolo.tosco.m...@gmail.com> > wrote: > >> Hi Susan, >> >> that looks like a bug in the way the MDL query is parsed; I have filed it >> here: >> https://github.com/rdkit/rdkit/issues/4785 >> >> If you can afford doing some Python massaging to your CTAB queries and >> converting them to SMARTS before submitting them to PostgreSQL when they >> fail sanitization, the following should work: >> >> mol_og = Chem.MolFromMolBlock(ctab_og, sanitize=False) >> try: >> Chem.SanitizeMol(mol_og) >> cur = conn.cursor() >> cur.execute(f"""select mol_from_smiles('{sm1}') @> >> qmol_from_ctab('{ctab_og}')""") >> rows = cur.fetchall() >> print(rows) >> except Chem.AtomKekulizeException as e: >> if re.match(r"non-ring atom \d+ marked aromatic", str(e)): >> Chem.FastFindRings(mol_og) >> rwmol_og = Chem.RWMol(mol_og) >> for a in mol_og.GetAtoms(): >> if a.GetIsAromatic() and not a.IsInRing(): >> rwmol_og.ReplaceAtom(a.GetIdx(), >> rdqueries.IsAromaticQueryAtom()) >> try: >> Chem.SanitizeMol(rwmol_og) >> smarts_og = Chem.MolToSmarts(rwmol_og) >> cur = conn.cursor() >> cur.execute(f"""select mol_from_smiles('{sm1}') @> >> mol_from_smarts('{smarts_og}')""") >> rows = cur.fetchall() >> print(rows) >> except: >> ... >> >> HTH, cheers >> p. >> >> >> On Thu, Dec 9, 2021 at 5:32 PM Susan Leung <susanhle...@gmail.com> wrote: >> >>> Hi all, >>> >>> >>> >>> I am trying to do some substructure queries using the RDKit PostgreSQL >>> cartridge. Specifically, my queries substructure inputs are CTAB (not >>> SMARTS) so I would like to use qmol_from_ctab. However, I have some >>> problems with making valid query molecules with a few CTABs. >>> >>> >>> >>> In this query, I try to use a CTAB to make a query to search for aryl >>> boronate acid/ester. I can make an equivalent query using SMARTS but the >>> CTAB is not valid. >>> >>> >>> As far as I am aware, there's no warning message when using the SQL >>> functions, so I use MolFromMolBlock from python and get "non-ring atom >>> 0 marked aromatic" so I correct the aromatic bond type to double bond >>> and the CTAB can be read in (but that's not the query I want). I am >>> guessing that there may be additional validity checks / sanitization >>> steps when executing qmol_from_ctab vs qmol_from_smarts? As far as I can >>> see, there’s no flag in qmol_from_ctab. >>> >>> >>> >>> I describe the general problem below but also attach the ipynb (if it is >>> useful) that uses psycopg2 to do the SQL , leaving out the database >>> connection credentials. >>> >>> >>> >>> Many thanks, >>> >>> >>> >>> Susan >>> >>> __________________________________ >>> >>> >>> >>> For example, I want to match an aromatic boronic acid: >>> >>> sm1 = 'OB(O)c1ccccc1' >>> >>> >>> >>> But the following CTAB isn’t valid. MolFromFromBlock returns non-ring >>> atom marked aromatic error so I suspect it’s to do with that. Also changing >>> the bond marked aromatic ‘4’ to a double bond ‘2’ makes the ctab valid. >>> >>> ctab_og = """Boronate acid/ester(aryl) >>> >>> SciTegic12012112112D >>> >>> >>> >>> 5 4 0 0 0 0 999 V2000 >>> >>> 1.7243 -2.7324 0.0000 A 0 0 >>> >>> 2.7559 -2.1456 0.0000 C 0 0 >>> >>> 3.7808 -2.7324 0.0000 B 0 0 >>> >>> 4.8057 -2.1456 0.0000 O 0 0 >>> >>> 3.7808 -3.9190 0.0000 O 0 0 >>> >>> 1 2 4 0 0 1 0 >>> >>> 2 3 1 0 >>> >>> 3 4 1 0 >>> >>> 3 5 1 0 >>> >>> M END >>> >>> > <Name> >>> >>> Boronate acid/ester(aryl) >>> >>> >>> >>> """ >>> >>> ctab_fixed = """Boronate acid/ester(aryl) >>> >>> SciTegic12012112112D >>> >>> >>> >>> 5 4 0 0 0 0 999 V2000 >>> >>> 1.7243 -2.7324 0.0000 A 0 0 >>> >>> 2.7559 -2.1456 0.0000 C 0 0 >>> >>> 3.7808 -2.7324 0.0000 B 0 0 >>> >>> 4.8057 -2.1456 0.0000 O 0 0 >>> >>> 3.7808 -3.9190 0.0000 O 0 0 >>> >>> 1 2 2 0 0 1 0 >>> >>> 2 3 1 0 >>> >>> 3 4 1 0 >>> >>> 3 5 1 0 >>> >>> M END >>> >>> > <Name> >>> >>> Boronate acid/ester(aryl) >>> >>> >>> >>> """ >>> >>> select is_valid_ctab('{ctab_og}') >>> >>> >>> >>> Returns False >>> >>> select is_valid_ctab('{ctab_fixed}') >>> >>> Returns True >>> >>> >>> >>> However, I can make a qmol using SMARTS match sm1. Is there of making >>> the query CTAB valid so we don’t have to use SMARTS? >>> >>> select mol_from_smiles('{sm1}') @> qmol_from_ctab('{ctab_fixed}') >>> >>> >>> >>> Returns False >>> >>> >>> >>> select mol_from_smiles('{sm1}') @> qmol_from_smarts('{alt_smarts}') >>> >>> >>> >>> Returns True >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss